[SURBL-Discuss] RFC: consensus list?
jeffc at surbl.org
Sat Nov 13 08:19:10 CET 2004
On Friday, November 12, 2004, 5:41:26 AM, Ryan Thompson wrote:
> Jeff Chan wrote to SURBL Discussion list:
>> We could probably experiment and try some different approaches
>> and see how they test out on corpora and live mail servers.
> A simple join(1) on the data files might be a better start:
>> SC and AB and WS and JP and OB
> Matches 202 records. That's going to have an extremely low detection
> rate. The problem is that "and" means "intersection", and by including
> ob in particular, you're automatically limiting the maximum size of the
> data to about 350 records.
>> ((SC or AB) and (JP or OB))
> Matches 1,187 records. Probably still too few.
Don't let the size of the list fool you though. Remember that
a few spam gangs send out most of the spam using zombies and
other ways that are hard to block with conventional RBLs.
Getting their domains at any given time probably does not entail
having a list of 100k domains. Just a few hundred domains
probably appear in a majority of spams at any particular time.
The question is "which hundreds?"... :-)
>> or PH
> Didn't feel like pulling PH out of multi for this test.
> Better, IMHO, is to use something like
> (SC + AB + JP + OB + WS) >= 3
> Matches 16,560 records. Aha! Now we're getting something useful.
> Without WS in that equation, the number drops to 906.
Qualifying by the number of lists could be useful to try,
though SC and AB should be lumped together since they're
both mostly from SpamCop URI reports. In other words
SC and AB aren't too independent in terms of their data
source. They're mostly a different slice of the same data
and should probably be treated as a single souce.
> You can try this with different lists if you want, or even mix in some
> judicious "and" and "or" matching. For instance, since there is a large
> overlap between jp and ws, you might want to choose one or the other.
All of JP is currently included in WS. They will be more
independent when we take JP out of WS, as we're planning to do
when SpamAssassin 3.1 gets released.
> But maybe it doesn't matter so much, because, in that case, you might
> just set the cutoff lower to compensate, so having the additional list
> would still add some small bit of confidence.
> To me, 3 currently looks like the likely sweet spot, although the hit
> rate on the ~2,500 domains present in four or more lists could still
> potentially put a sizeable dent in spam at the MTA level at a lower FP
> rate. I'd recommend looking at 3 and 4 a little more closely:
> By definition, 4 is a strict subset of 3, so if FP(n>=N) is the false
> positive rate of a list with domains in N-or-more lists,
FP(n>>=3) >= FP(n>=4). Thus, this approach also has the added benefit of
> allowing you to at least discretely control the FP rate somewhat.
FP rates should increase with "ors" and decrease with "ands".
I probably won't be useing UC, but the principle is the same
for whatever lists are used.
Thanks for sharing your ideas,
"If it appears in hams, then don't list it."
More information about the Discuss