Re: [SURBL-Discuss] Ham corpora needed

6 Sep 2004


      On Sunday, September 5, 2004, 3:45:20 PM, Ryan Thompson wrote:
...
Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:
...
...
On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:
...
FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a
beefy machine with rbldnsd running on localhost, with 20 concurrent
jobs. (mass-check is slower than molasses for anything that blocks if
you don't let it run concurrent jobs :-)
One shortcut, which may be adequate for purposes of cleaning up the
SURBL data, might be to simply extract the URI domains from the ham
corpus, sort and unique that list, then compare that ham URI domain
list against the SURBL under test.  Hits could be matched up against
the source message.  Since the hits are relatively few that could save
much processing over using full SA on every message.
...
Yeah. The *best* solution would be to have our own mass-checker. My
GetURI (http://ry.ca/geturi/) could probably be extended for the task
without much work, since it already extracts URIs and is capable of
producing statistics.
...
Maybe if I wired it to *also* accept a ham directory, it could
cross-check domains in both corpora and list possible FPs.
But we should be able to use it singly against a directory
of ham messages, right?  The only difference is that the
output would be ham domains and not spam domains....
We'd then compare that ham list to a SURBL and find the
FPs....
I may try that against the SpamAssassin public corpora
that Justin replied about, unless you beat me to it. :-)
http://spamassassin.apache.org/publiccorpus/
Jeff C.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [SURBL-Discuss] Ham corpora needed