Jeff Chan wrote to SURBL Discussion list:
On Sunday, September 5, 2004, 3:45:20 PM, Ryan Thompson wrote:
Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:
On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:
FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)
One shortcut, which may be adequate for purposes of cleaning up the SURBL data, might be to simply extract the URI domains from the ham corpus, sort and unique that list, then compare that ham URI domain list against the SURBL under test. Hits could be matched up against the source message. Since the hits are relatively few that could save much processing over using full SA on every message.
Yeah. The *best* solution would be to have our own mass-checker. My GetURI (http://ry.ca/geturi/) could probably be extended for the task without much work, since it already extracts URIs and is capable of producing statistics.
Maybe if I wired it to *also* accept a ham directory, it could cross-check domains in both corpora and list possible FPs.
But we should be able to use it singly against a directory of ham messages, right?
Well, yes, but GetURI excludes any domains that are listed in multi, so I'd almost need to throw in a switch to flip the logic. It's a good idea, but I have a friend's graduation party to attend, so it isn't going to happen tonight. :-)
That, and the sorting would be all back-asswards.
I have many good plans (and code) for GetURI 1.5 that'll make all of this easy. I really want to figure out a better way to get the domain age, though!
The only difference is that the output would be ham domains and not spam domains.... We'd then compare that ham list to a SURBL and find the FPs....
I may try that against the SpamAssassin public corpora that Justin replied about, unless you beat me to it. :-)
Just for kicks, I ran it on the easy_ham corpus (about 2500h), and got the following:
http://ry.ca/geturi/easy_ham.html (1MB)
The output would be *much* cleaner with just a few optimizations/flags I have in mind for GetURI. Still, the above might be of some value.
- Ryan