[SURBL-Discuss] Ham corpora needed

Jeff Chan jeffc at surbl.org
Mon Sep 6 01:16:51 CEST 2004


On Sunday, September 5, 2004, 3:45:20 PM, Ryan Thompson wrote:
> Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:

>> On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:
>>
>>> FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a
>>> beefy machine with rbldnsd running on localhost, with 20 concurrent
>>> jobs. (mass-check is slower than molasses for anything that blocks if
>>> you don't let it run concurrent jobs :-)
>>
>> One shortcut, which may be adequate for purposes of cleaning up the
>> SURBL data, might be to simply extract the URI domains from the ham
>> corpus, sort and unique that list, then compare that ham URI domain
>> list against the SURBL under test.  Hits could be matched up against
>> the source message.  Since the hits are relatively few that could save
>> much processing over using full SA on every message.

> Yeah. The *best* solution would be to have our own mass-checker. My
> GetURI (http://ry.ca/geturi/) could probably be extended for the task
> without much work, since it already extracts URIs and is capable of
> producing statistics.

> Maybe if I wired it to *also* accept a ham directory, it could
> cross-check domains in both corpora and list possible FPs.

But we should be able to use it singly against a directory
of ham messages, right?  The only difference is that the
output would be ham domains and not spam domains....
We'd then compare that ham list to a SURBL and find the
FPs....

I may try that against the SpamAssassin public corpora
that Justin replied about, unless you beat me to it. :-)

  http://spamassassin.apache.org/publiccorpus/

Jeff C.



More information about the Discuss mailing list