Re: [SURBL-Discuss] RFC: consensus list?

13 Nov 2004


      On Friday, November 12, 2004, 5:41:26 AM, Ryan Thompson wrote:
...
Jeff Chan wrote to SURBL Discussion list:
...
...
We could probably experiment and try some different approaches
and see how they test out on corpora and live mail servers.
...
A simple join(1) on the data files might be a better start:
...
...
SC and AB and WS and JP and OB
...
Matches 202 records. That's going to have an extremely low detection
rate. The problem is that "and" means "intersection", and by including
ob in particular, you're automatically limiting the maximum size of the
data to about 350 records.
...
...
((SC or AB) and (JP or OB))
...
Matches 1,187 records. Probably still too few.
Don't let the size of the list fool you though.  Remember that
a few spam gangs send out most of the spam using zombies and
other ways that are hard to block with conventional RBLs.
Getting their domains at any given time probably does not entail
having a list of 100k domains.   Just a few hundred domains
probably appear in a majority of spams at any particular time.
The question is "which hundreds?"...  :-)
...
...
or PH
...
Didn't feel like pulling PH out of multi for this test.
...
Better, IMHO, is to use something like
...
(SC + AB + JP + OB + WS) >= 3
...
Matches 16,560 records. Aha! Now we're getting something useful.
...
Without WS in that equation, the number drops to 906.
Qualifying by the number of lists could be useful to try,
though SC and AB should be lumped together since they're
both mostly from SpamCop URI reports.  In other words
SC and AB aren't too independent in terms of their data
source.  They're mostly a different slice of the same data
and should probably be treated as a single souce.
...
You can try this with different lists if you want, or even mix in some
judicious "and" and "or" matching. For instance, since there is a large
overlap between jp and ws, you might want to choose one or the other.
All of JP is currently included in WS.  They will be more
independent when we take JP out of WS, as we're planning to do
when SpamAssassin 3.1 gets released.
...
But maybe it doesn't matter so much, because, in that case, you might
just set the cutoff lower to compensate, so having the additional list
would still add some small bit of confidence.
...
To me, 3 currently looks like the likely sweet spot, although the hit
rate on the ~2,500 domains present in four or more lists could still
potentially put a sizeable dent in spam at the MTA level at a lower FP
rate. I'd recommend looking at 3 and 4 a little more closely:
...
 http://ry.ca/surbl/ab+jp+ob+sc+ws+uc3.txt
 http://ry.ca/surbl/ab+jp+ob+sc+ws+uc4.txt

...
By definition, 4 is a strict subset of 3, so if FP(n>=N) is the false
positive rate of a list with domains in N-or-more lists,
FP(n>>=3) >= FP(n>=4). Thus, this approach also has the added benefit of
...
allowing you to at least discretely control the FP rate somewhat.
FP rates should increase with "ors" and decrease with "ands".
I probably won't be useing UC, but the principle is the same
for whatever lists are used.
Thanks for sharing your ideas,
Jeff C.
--
"If it appears in hams, then don't list it."

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [SURBL-Discuss] RFC: consensus list?