[SURBL-Discuss] Ham corpora needed

Ryan Thompson ryan at sasknow.com
Mon Sep 6 02:17:39 CEST 2004


Jeff Chan wrote to SURBL Discussion list:

> On Sunday, September 5, 2004, 3:45:20 PM, Ryan Thompson wrote:
>> Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:
>
>>> On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:
>>>
>>>> FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a
>>>> beefy machine with rbldnsd running on localhost, with 20 concurrent
>>>> jobs. (mass-check is slower than molasses for anything that blocks if
>>>> you don't let it run concurrent jobs :-)
>>>
>>> One shortcut, which may be adequate for purposes of cleaning up the
>>> SURBL data, might be to simply extract the URI domains from the ham
>>> corpus, sort and unique that list, then compare that ham URI domain
>>> list against the SURBL under test.  Hits could be matched up against
>>> the source message.  Since the hits are relatively few that could save
>>> much processing over using full SA on every message.
>
>> Yeah. The *best* solution would be to have our own mass-checker. My
>> GetURI (http://ry.ca/geturi/) could probably be extended for the task
>> without much work, since it already extracts URIs and is capable of
>> producing statistics.
>
>> Maybe if I wired it to *also* accept a ham directory, it could
>> cross-check domains in both corpora and list possible FPs.
>
> But we should be able to use it singly against a directory
> of ham messages, right?

Well, yes, but GetURI excludes any domains that are listed in multi,
so I'd almost need to throw in a switch to flip the logic. It's a good
idea, but I have a friend's graduation party to attend, so it isn't
going to happen tonight. :-)

That, and the sorting would be all back-asswards.

I have many good plans (and code) for GetURI 1.5 that'll make all of
this easy. I really want to figure out a better way to get the domain
age, though!

> The only difference is that the
> output would be ham domains and not spam domains....
> We'd then compare that ham list to a SURBL and find the
> FPs....
>
> I may try that against the SpamAssassin public corpora
> that Justin replied about, unless you beat me to it. :-)
>
>  http://spamassassin.apache.org/publiccorpus/

Just for kicks, I ran it on the easy_ham corpus (about 2500h), and got
the following:

http://ry.ca/geturi/easy_ham.html (1MB)

The output would be *much* cleaner with just a few optimizations/flags
I have in mind for GetURI. Still, the above might be of some value.

- Ryan

-- 
   Ryan Thompson <ryan at sasknow.com>

   SaskNow Technologies - http://www.sasknow.com
   901-1st Avenue North - Saskatoon, SK - S7K 1Y4

         Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
   Toll-Free: 877-727-5669     (877-SASKNOW)     North America


More information about the Discuss mailing list