Re: [SURBL-Discuss] Ham corpora needed

6 Sep 2004


      On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:
...
FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a
beefy machine with rbldnsd running on localhost, with 20 concurrent
jobs. (mass-check is slower than molasses for anything that blocks if
you don't let it run concurrent jobs :-)
One shortcut, which may be adequate for purposes of cleaning up
the SURBL data, might be to simply extract the URI domains from
the ham corpus, sort and unique that list, then compare that ham
URI domain list against the SURBL under test.  Hits could be
matched up against the source message.  Since the hits are
relatively few that could save much processing over using full
SA on every message.
Yes it doesn't get the full stats, and yes, it could
miscategorize a few, but the hits are so few that it could
be useable.  On the other hand, because the hits *are* few,
missing a few may be a bigger deal.
Might be interesting to try it both ways and see if the
results differ much.
Jeff C.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [SURBL-Discuss] Ham corpora needed