On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:
FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)
One shortcut, which may be adequate for purposes of cleaning up the SURBL data, might be to simply extract the URI domains from the ham corpus, sort and unique that list, then compare that ham URI domain list against the SURBL under test. Hits could be matched up against the source message. Since the hits are relatively few that could save much processing over using full SA on every message.
Yes it doesn't get the full stats, and yes, it could miscategorize a few, but the hits are so few that it could be useable. On the other hand, because the hits *are* few, missing a few may be a bigger deal.
Might be interesting to try it both ways and see if the results differ much.
Jeff C.