On Friday, November 12, 2004, 5:41:26 AM, Ryan Thompson wrote:
Jeff Chan wrote to SURBL Discussion list:
We could probably experiment and try some different approaches and see how they test out on corpora and live mail servers.
A simple join(1) on the data files might be a better start:
SC and AB and WS and JP and OB
Matches 202 records. That's going to have an extremely low detection rate. The problem is that "and" means "intersection", and by including ob in particular, you're automatically limiting the maximum size of the data to about 350 records.
((SC or AB) and (JP or OB))
Matches 1,187 records. Probably still too few.
Don't let the size of the list fool you though. Remember that a few spam gangs send out most of the spam using zombies and other ways that are hard to block with conventional RBLs. Getting their domains at any given time probably does not entail having a list of 100k domains. Just a few hundred domains probably appear in a majority of spams at any particular time. The question is "which hundreds?"... :-)
or PH
Didn't feel like pulling PH out of multi for this test.
Better, IMHO, is to use something like
(SC + AB + JP + OB + WS) >= 3
Matches 16,560 records. Aha! Now we're getting something useful.
Without WS in that equation, the number drops to 906.
Qualifying by the number of lists could be useful to try, though SC and AB should be lumped together since they're both mostly from SpamCop URI reports. In other words SC and AB aren't too independent in terms of their data source. They're mostly a different slice of the same data and should probably be treated as a single souce.
You can try this with different lists if you want, or even mix in some judicious "and" and "or" matching. For instance, since there is a large overlap between jp and ws, you might want to choose one or the other.
All of JP is currently included in WS. They will be more independent when we take JP out of WS, as we're planning to do when SpamAssassin 3.1 gets released.
But maybe it doesn't matter so much, because, in that case, you might just set the cutoff lower to compensate, so having the additional list would still add some small bit of confidence.
To me, 3 currently looks like the likely sweet spot, although the hit rate on the ~2,500 domains present in four or more lists could still potentially put a sizeable dent in spam at the MTA level at a lower FP rate. I'd recommend looking at 3 and 4 a little more closely:
http://ry.ca/surbl/ab+jp+ob+sc+ws+uc3.txt http://ry.ca/surbl/ab+jp+ob+sc+ws+uc4.txt
By definition, 4 is a strict subset of 3, so if FP(n>=N) is the false positive rate of a list with domains in N-or-more lists,
FP(n>>=3) >= FP(n>=4). Thus, this approach also has the added benefit of
allowing you to at least discretely control the FP rate somewhat.
FP rates should increase with "ors" and decrease with "ands". I probably won't be useing UC, but the principle is the same for whatever lists are used.
Thanks for sharing your ideas,
Jeff C. -- "If it appears in hams, then don't list it."