Re: quick poll on SURBL hit % - Discuss

10 Jan 2005


      Chris Santerre wrote:
...
Just curious as to what average percent of spam people see SURBL 
hitting. In a non scientific manor, I average about 85% ...
I've run multiple analyses on historical datasets, and get a consistent 
*average* of 82%-86%, so 84% is a decent estimate.
The most noteworthy statistical characteristic of the SURBL hit rate 
over time is the large *variance* in hit rate.  Some days, the SURBL hit 
rate I observe in my data is in the 60%'s, while other days its in the 
90%'s.  The fluctuation appears to be at least somewhat periodic in 
nature (several "low" days in a row, followed by several "high" days).  
I've not actually run the numbers, but my totally informal, *purely gut* 
sense is that the magnitude of that variance may have diminished lately, 
but the periodic pattern persists.  These periodic fluctuations imply 
that there is probably some systematic cause underlying this variance, 
and that cause is itself almost certainly periodic in nature.
...
I have a feeling if I clean up my
results a bit, that number would be even higher.
I've talked about this with Jeff several times, and he's even shared 
some of my comments with this list.  No one in the anti-spam world likes 
hearing this, but there is very strong evidence of a "hard" statistical 
detection limit right around ~85%.  This limit appears to be more or 
less independent of data set or detection method.
This is not to say that only 85% of spam can be stopped; rather, it 
implies that no one *single* detection method broadly applicable to 
spam-as-a-whole can reliably (repeatably, predictably, consistently) 
snag more than ~85%.
(At the risk of sounding testy, I am *not* looking to debate this point.  
Under the heading of "you can lead a horse to water," I'm content to 
have anyone to believe anything they like, even if it's demonstrably 
wrong.  All I can do is report empirical results based on multiple 
analyses, multiple detection methods, multiple datasets, and multiple 
time periods.  The observed ~85% detection limit is remarkably 
consistent across all these conditions.)
One of the vaguely interesting research questions implicit in this 
result is the determination of the *unique* contribution of any given 
approach.  That is, what %-age of spam *missed by all other approaches* 
does a particular detection method snag?  On my *personal* spam corpus, 
SURBL's *unique* contribution is in the 18%-20% range.