[SURBL-Discuss] Ham corpora needed

Ryan Thompson ryan at sasknow.com
Mon Sep 6 00:45:20 CEST 2004


Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:

> On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:
>
>> FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a
>> beefy machine with rbldnsd running on localhost, with 20 concurrent
>> jobs. (mass-check is slower than molasses for anything that blocks if
>> you don't let it run concurrent jobs :-)
>
> One shortcut, which may be adequate for purposes of cleaning up the
> SURBL data, might be to simply extract the URI domains from the ham
> corpus, sort and unique that list, then compare that ham URI domain
> list against the SURBL under test.  Hits could be matched up against
> the source message.  Since the hits are relatively few that could save
> much processing over using full SA on every message.

Yeah. The *best* solution would be to have our own mass-checker. My
GetURI (http://ry.ca/geturi/) could probably be extended for the task
without much work, since it already extracts URIs and is capable of
producing statistics.

Maybe if I wired it to *also* accept a ham directory, it could
cross-check domains in both corpora and list possible FPs.

Yeah. Next version.

;-)

- Ryan

-- 
   Ryan Thompson <ryan at sasknow.com>

   SaskNow Technologies - http://www.sasknow.com
   901-1st Avenue North - Saskatoon, SK - S7K 1Y4

         Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
   Toll-Free: 877-727-5669     (877-SASKNOW)     North America


More information about the Discuss mailing list