[SURBL-Discuss] RFC: consensus list?

Ryan Thompson ryan at sasknow.com
Fri Nov 12 14:41:26 CET 2004


Jeff Chan wrote to SURBL Discussion list:

> We could probably experiment and try some different approaches
> and see how they test out on corpora and live mail servers.

A simple join(1) on the data files might be a better start:

>  SC and AB and WS and JP and OB

Matches 202 records. That's going to have an extremely low detection
rate. The problem is that "and" means "intersection", and by including
ob in particular, you're automatically limiting the maximum size of the
data to about 350 records.

>  ((SC or AB) and (JP or OB))

Matches 1,187 records. Probably still too few.

> or PH

Didn't feel like pulling PH out of multi for this test.

Better, IMHO, is to use something like

   (SC + AB + JP + OB + WS) >= 3

Matches 16,560 records. Aha! Now we're getting something useful.

Without WS in that equation, the number drops to 906.

With UC and WS, the number rises to 18,964.

Other numbers, with SC + AB + JP + OB + WS + UC:

     SC+AB+JP+OB+WS+UC	# of records
     -----------------	------------
     1			      39,759
     2			      25,549
     3			      16,369
     4			       2,298
     5			         292
     6			           5

     >= 2		      44,513 superset of ...
     >= 3		      18,964 ...
     >= 4		       2,595 ..
     >= 5		         297 .

You can try this with different lists if you want, or even mix in some
judicious "and" and "or" matching. For instance, since there is a large
overlap between jp and ws, you might want to choose one or the other.
But maybe it doesn't matter so much, because, in that case, you might
just set the cutoff lower to compensate, so having the additional list
would still add some small bit of confidence.

To me, 3 currently looks like the likely sweet spot, although the hit
rate on the ~2,500 domains present in four or more lists could still
potentially put a sizeable dent in spam at the MTA level at a lower FP
rate. I'd recommend looking at 3 and 4 a little more closely:

     http://ry.ca/surbl/ab+jp+ob+sc+ws+uc3.txt
     http://ry.ca/surbl/ab+jp+ob+sc+ws+uc4.txt

By definition, 4 is a strict subset of 3, so if FP(n>=N) is the false
positive rate of a list with domains in N-or-more lists, 
FP(n>=3) >= FP(n>=4). Thus, this approach also has the added benefit of
allowing you to at least discretely control the FP rate somewhat.

Have fun!
- Ryan

-- 
   Ryan Thompson <ryan at sasknow.com>

   SaskNow Technologies - http://www.sasknow.com
   901-1st Avenue North - Saskatoon, SK - S7K 1Y4

         Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
   Toll-Free: 877-727-5669     (877-SASKNOW)     North America


More information about the Discuss mailing list