[SURBL-Discuss] Ham corpora needed

Jeff Chan jeffc at surbl.org
Mon Sep 6 02:39:54 CEST 2004


On Sunday, September 5, 2004, 5:17:39 PM, Ryan Thompson wrote:
> Jeff Chan wrote to SURBL Discussion list:

>> On Sunday, September 5, 2004, 3:45:20 PM, Ryan Thompson wrote:
>>> Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:
>>
>>>> On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:
>>>>
>>>>> FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a
>>>>> beefy machine with rbldnsd running on localhost, with 20 concurrent
>>>>> jobs. (mass-check is slower than molasses for anything that blocks if
>>>>> you don't let it run concurrent jobs :-)
>>>>
>>>> One shortcut, which may be adequate for purposes of cleaning up the
>>>> SURBL data, might be to simply extract the URI domains from the ham
>>>> corpus, sort and unique that list, then compare that ham URI domain
>>>> list against the SURBL under test.  Hits could be matched up against
>>>> the source message.  Since the hits are relatively few that could save
>>>> much processing over using full SA on every message.
>>
>>> Yeah. The *best* solution would be to have our own mass-checker. My
>>> GetURI (http://ry.ca/geturi/) could probably be extended for the task
>>> without much work, since it already extracts URIs and is capable of
>>> producing statistics.
>>
>>> Maybe if I wired it to *also* accept a ham directory, it could
>>> cross-check domains in both corpora and list possible FPs.
>>
>> But we should be able to use it singly against a directory
>> of ham messages, right?

> Well, yes, but GetURI excludes any domains that are listed in multi,
> so I'd almost need to throw in a switch to flip the logic. It's a good
> idea, but I have a friend's graduation party to attend, so it isn't
> going to happen tonight. :-)

Hmm, could you make a leaner, separate program, say
"ExtractURIDomain" that produced a list of URI hosts
on standard output when fed mail text on standard
input?  That could be useful as a general filter
program.

>> I may try that against the SpamAssassin public corpora
>> that Justin replied about, unless you beat me to it. :-)
>>
>>  http://spamassassin.apache.org/publiccorpus/

> Just for kicks, I ran it on the easy_ham corpus (about 2500h), and got
> the following:

> http://ry.ca/geturi/easy_ham.html (1MB)

> The output would be *much* cleaner with just a few optimizations/flags
> I have in mind for GetURI. Still, the above might be of some value.

> - Ryan

Grabbing...

Jeff C.



More information about the Discuss mailing list