On Friday, November 12, 2004, 7:38:45 AM, John Wilcock wrote:
On Fri, 12 Nov 2004 04:31:37 -0800, Jeff Chan wrote:
We could probably experiment and try some different approaches and see how they test out on corpora and live mail servers.
Should be very easy to test with SpamAssassin for anyone with a decent corpus - just write some meta rules to simulate the intersections (or Ryan's suggested additive combinations).
And I have another technique I can use here: Take the lists and permutations of lists then see what percentage of each of those hit DNS queries matching blocklists in general. Recall that we now have statistics about whitelist, blocklist and unmatched DNS queries sampled from a DNS server. That means we can estimate spam detection rates by lists and permutations of lists purely based on SURBL DNS hits.
This is not as good as proper corpus checks, since our blocklist hits may include some FPs, but it does give some indication of the general spam detection rates of the lists or their permutations. The best of those results could then be checked against hand-checked corpora with some confidence that we're at least checking the most promising ones.
Gonna code this up....
Jeff C. -- "If it appears in hams, then don't list it."