Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:
In order to reduce false positives in the SURBL data, we would like to have access to ham corpora. Does anyone know of any public ham copora, including just the URI domain names from the hams? Or is there anyone who would be willing to run our URI domain lists against their ham?
Does anyone know if messages from the Enron corpus have been categorized for ham and spam?
http://www-2.cs.cmu.edu/~enron/
Thanks in advance for any suggestions, comments, thoughts....
FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)
Now, I know not everybody runs SpamAssassin, but it *does* have a really easy log format and hit-frequencies program. It's possible to concatenate ham and spam logs from different sources to effectively get statistics on a larger corpus... and only the test hits are stored in the log, so the results are effectively anonymous.
There's ham.log for ham, and spam.log for spam, and the entries look like this, one line per message:
Y 7 /spamdir/11710. URIBL_OB_SURBL,URIBL_WS_SURBL time=1089946124
Rather than re-invent the wheel, you can have your checkers output simplified mass-check logs. The only column that matters is the tests column. Something like this should work well enough for hit-frequencies:
N 0 <any_string> URIBL_TESTS_HIT,COMMA_DELIMITED time=<any_integer>
Then, grab hit-frequencies from the SA distribution and you can reproduce the output that others have been posting.
If you *do* have SA installed (even if you don't filter your mail with it), it's even easier. Just set up a simple .cf file with the URIBL rules (I'll provide one on request), and invoke mass-check in the tools directory like so:
./mass-check -p=../rules -c=../rules --net -j=20 --progress \ spam:dir:${SPAMDIR} ham:dir:${HAMDIR}
Then run:
./hit-frequencies -s 3 -p
It's almost worth extracting Mail-SpamAssassin from CPAN just to gain that functionality. You don't even have to *use* SA. :-)
- Ryan