Chris Santerre wrote:
Just curious as to what average percent of spam people see SURBL hitting. In a non scientific manor, I average about 85% ...
I've run multiple analyses on historical datasets, and get a consistent *average* of 82%-86%, so 84% is a decent estimate.
The most noteworthy statistical characteristic of the SURBL hit rate over time is the large *variance* in hit rate. Some days, the SURBL hit rate I observe in my data is in the 60%'s, while other days its in the 90%'s. The fluctuation appears to be at least somewhat periodic in nature (several "low" days in a row, followed by several "high" days). I've not actually run the numbers, but my totally informal, *purely gut* sense is that the magnitude of that variance may have diminished lately, but the periodic pattern persists. These periodic fluctuations imply that there is probably some systematic cause underlying this variance, and that cause is itself almost certainly periodic in nature.
I have a feeling if I clean up my results a bit, that number would be even higher.
I've talked about this with Jeff several times, and he's even shared some of my comments with this list. No one in the anti-spam world likes hearing this, but there is very strong evidence of a "hard" statistical detection limit right around ~85%. This limit appears to be more or less independent of data set or detection method.
This is not to say that only 85% of spam can be stopped; rather, it implies that no one *single* detection method broadly applicable to spam-as-a-whole can reliably (repeatably, predictably, consistently) snag more than ~85%.
(At the risk of sounding testy, I am *not* looking to debate this point. Under the heading of "you can lead a horse to water," I'm content to have anyone to believe anything they like, even if it's demonstrably wrong. All I can do is report empirical results based on multiple analyses, multiple detection methods, multiple datasets, and multiple time periods. The observed ~85% detection limit is remarkably consistent across all these conditions.)
One of the vaguely interesting research questions implicit in this result is the determination of the *unique* contribution of any given approach. That is, what %-age of spam *missed by all other approaches* does a particular detection method snag? On my *personal* spam corpus, SURBL's *unique* contribution is in the 18%-20% range.