[SURBL-Discuss] Ham corpora needed

Ryan Thompson ryan at sasknow.com
Sun Sep 5 23:56:04 CEST 2004


Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:

> In order to reduce false positives in the SURBL data, we would
> like to have access to ham corpora.  Does anyone know of any
> public ham copora, including just the URI domain names from the
> hams?  Or is there anyone who would be willing to run our URI
> domain lists against their ham?
>
> Does anyone know if messages from the Enron corpus have been
> categorized for ham and spam?
>
>  http://www-2.cs.cmu.edu/~enron/
>
> Thanks in advance for any suggestions, comments, thoughts....

FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a
beefy machine with rbldnsd running on localhost, with 20 concurrent
jobs. (mass-check is slower than molasses for anything that blocks if
you don't let it run concurrent jobs :-)

Now, I know not everybody runs SpamAssassin, but it *does* have a really
easy log format and hit-frequencies program. It's possible to
concatenate ham and spam logs from different sources to effectively get
statistics on a larger corpus... and only the test hits are stored in
the log, so the results are effectively anonymous.

There's ham.log for ham, and spam.log for spam, and the entries look
like this, one line per message:

Y  7 /spamdir/11710. URIBL_OB_SURBL,URIBL_WS_SURBL time=1089946124

Rather than re-invent the wheel, you can have your checkers output
simplified mass-check logs. The only column that matters is the tests
column. Something like this should work well enough for hit-frequencies:

N  0 <any_string> URIBL_TESTS_HIT,COMMA_DELIMITED time=<any_integer>

Then, grab hit-frequencies from the SA distribution and you can
reproduce the output that others have been posting.

If you *do* have SA installed (even if you don't filter your mail with
it), it's even easier. Just set up a simple .cf file with the URIBL
rules (I'll provide one on request), and invoke mass-check in the tools
directory like so:

     ./mass-check -p=../rules -c=../rules --net -j=20 --progress \
 	spam:dir:${SPAMDIR} ham:dir:${HAMDIR}

Then run:

     ./hit-frequencies -s 3 -p

It's almost worth extracting Mail-SpamAssassin from CPAN just to gain
that functionality. You don't even have to *use* SA. :-)

- Ryan

-- 
   Ryan Thompson <ryan at sasknow.com>

   SaskNow Technologies - http://www.sasknow.com
   901-1st Avenue North - Saskatoon, SK - S7K 1Y4

         Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
   Toll-Free: 877-727-5669     (877-SASKNOW)     North America


More information about the Discuss mailing list