Chris Santerre wrote:
>Just curious as to what average percent of spam people see SURBL
>hitting. In a non scientific manor, I average about 85% ...
I've run multiple analyses on historical datasets, and get a consistent
*average* of 82%-86%, so 84% is a decent estimate.
The most noteworthy statistical characteristic of the SURBL hit rate
over time is the large *variance* in hit rate. Some days, the SURBL hit
rate I observe in my data is in the 60%'s, while other days its in the
90%'s. The fluctuation appears to be at least somewhat periodic in
nature (several "low" days in a row, followed by several "high" days).
I've not actually run the numbers, but my totally informal, *purely gut*
sense is that the magnitude of that variance may have diminished lately,
but the periodic pattern persists. These periodic fluctuations imply
that there is probably some systematic cause underlying this variance,
and that cause is itself almost certainly periodic in nature.
>I have a feeling if I clean up my
>results a bit, that number would be even higher.
I've talked about this with Jeff several times, and he's even shared
some of my comments with this list. No one in the anti-spam world likes
hearing this, but there is very strong evidence of a "hard" statistical
detection limit right around ~85%. This limit appears to be more or
less independent of data set or detection method.
This is not to say that only 85% of spam can be stopped; rather, it
implies that no one *single* detection method broadly applicable to
spam-as-a-whole can reliably (repeatably, predictably, consistently)
snag more than ~85%.
(At the risk of sounding testy, I am *not* looking to debate this point.
Under the heading of "you can lead a horse to water," I'm content to
have anyone to believe anything they like, even if it's demonstrably
wrong. All I can do is report empirical results based on multiple
analyses, multiple detection methods, multiple datasets, and multiple
time periods. The observed ~85% detection limit is remarkably
consistent across all these conditions.)
One of the vaguely interesting research questions implicit in this
result is the determination of the *unique* contribution of any given
approach. That is, what %-age of spam *missed by all other approaches*
does a particular detection method snag? On my *personal* spam corpus,
SURBL's *unique* contribution is in the 18%-20% range.