Pondering the question of how to make a "telco grade" SURBL that had as close to zero false positives as possible, but would still catche many spams, I remembered that many of the biggest spam domains seem to appear in several different SURBL lists.
What does anyone think about creating a "consensus" list that a telco or ISP might use to block at the MTA level?
For example a domain that appears on:
((SC or AB) and (JP or OB)) or PH
might be a candidate for such a list. The main reason I don't include WS is that it's a hand built list and I don't have a feeling for the latencies from it.
SC and AB are both mostly based on SpamCop user reports. JP and OB are both mostly based on spamtrap data. PH represents really destructive fraud and phishing and probably should be included unless the FP rates from it are significantly above zero.
I realize this is a simplistic scheme and other ways of combining the list are possible, but what does anyone think those idea?
Conceivably we could have other combinations.
Another possibility might be records that appear in
SC and AB and WS and JP and OB
I think we can nearly guarantee that those are 100% spam. :-) (Would want to check those that are in WS separately from JP, which is currently included in WS.)
What other ways to combine lists might produce near zero FPs yet still hit most spam?
Shall we just try some of them and see how well they work?
Comments?
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
Another possibility might be records that appear in
SC and AB and WS and JP and OB
I think we can nearly guarantee that those are 100% spam. :-) (Would want to check those that are in WS separately from JP, which is currently included in WS.)
What other ways to combine lists might produce near zero FPs yet still hit most spam?
Shall we just try some of them and see how well they work?
Comments?
From my eval:
- JP & SC & AB have been 0 FP for me.
- OB more FPs than expected - the direct whitelisting site has been a godsend. OB guys very fast.
- WS 0 FPs since last clean up. (as FP are not always made pubic its hard to say)
- SC & AB are very slow in publishing new data - JP & WS beat them all
This would mean less subzones to reduce DNS queries overall allowing faster SA processing
my vote: merge JP,SC,AB. - "safe.surb.org"
Alex
Alex Broens wrote:
Jeff Chan wrote:
Another possibility might be records that appear in
SC and AB and WS and JP and OB
I think we can nearly guarantee that those are 100% spam. :-) (Would want to check those that are in WS separately from JP, which is currently included in WS.)
What other ways to combine lists might produce near zero FPs yet still hit most spam?
Shall we just try some of them and see how well they work?
Comments?
From my eval:
JP & SC & AB have been 0 FP for me.
OB more FPs than expected - the direct whitelisting site has been a
godsend. OB guys very fast.
- WS 0 FPs since last clean up. (as FP are not always made pubic its
hard to say)
- SC & AB are very slow in publishing new data - JP & WS beat them all
This would mean less subzones to reduce DNS queries overall allowing faster SA processing
my vote: merge JP,SC,AB. - "safe.surb.org"
Sorry, forgot to add PH to "safe.surbl.org"
Hi!
This would mean less subzones to reduce DNS queries overall allowing faster SA processing
my vote: merge JP,SC,AB. - "safe.surb.org"
Sorry, forgot to add PH to "safe.surbl.org"
Perhaps i am missing the point, but we have multi, how does this minimize lookups? I allready do 1 lookup now for all zones.
Bye, Raymond.
On Friday, November 12, 2004, 4:11:10 AM, Raymond Dijkxhoorn wrote:
This would mean less subzones to reduce DNS queries overall allowing faster SA processing
my vote: merge JP,SC,AB. - "safe.surb.org"
Sorry, forgot to add PH to "safe.surbl.org"
Perhaps i am missing the point, but we have multi, how does this minimize lookups? I allready do 1 lookup now for all zones.
The idea is to have a "golden" list that is so accurate (for spams it does identify) and has so few FPs that it would be safe to block outright at the MTA level, for example.
An MTA milter or plugin probably can't determine that from multi alone (i.e. looking at all the lists), so I'm wondering if we can create a list for that purpose out of the existing lists.
The new list would be just another bit in multi, so it would still be a single lookup, but it would be looking up a highly accurate, hopefully zero FP new subset within multi.
SC and JP already come pretty close to that goal, but I'm wondering if some clever combination of the lists might be useable to create a single list to outperform the individual lists. Obviously any subset will hit less spam, but the goal would be to make a subset that hits no FPs.
We could probably experiment and try some different approaches and see how they test out on corpora and live mail servers.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote to SURBL Discussion list:
We could probably experiment and try some different approaches and see how they test out on corpora and live mail servers.
A simple join(1) on the data files might be a better start:
SC and AB and WS and JP and OB
Matches 202 records. That's going to have an extremely low detection rate. The problem is that "and" means "intersection", and by including ob in particular, you're automatically limiting the maximum size of the data to about 350 records.
((SC or AB) and (JP or OB))
Matches 1,187 records. Probably still too few.
or PH
Didn't feel like pulling PH out of multi for this test.
Better, IMHO, is to use something like
(SC + AB + JP + OB + WS) >= 3
Matches 16,560 records. Aha! Now we're getting something useful.
Without WS in that equation, the number drops to 906.
With UC and WS, the number rises to 18,964.
Other numbers, with SC + AB + JP + OB + WS + UC:
SC+AB+JP+OB+WS+UC # of records ----------------- ------------ 1 39,759 2 25,549 3 16,369 4 2,298 5 292 6 5
>= 2 44,513 superset of ... >= 3 18,964 ... >= 4 2,595 .. >= 5 297 .
You can try this with different lists if you want, or even mix in some judicious "and" and "or" matching. For instance, since there is a large overlap between jp and ws, you might want to choose one or the other. But maybe it doesn't matter so much, because, in that case, you might just set the cutoff lower to compensate, so having the additional list would still add some small bit of confidence.
To me, 3 currently looks like the likely sweet spot, although the hit rate on the ~2,500 domains present in four or more lists could still potentially put a sizeable dent in spam at the MTA level at a lower FP rate. I'd recommend looking at 3 and 4 a little more closely:
http://ry.ca/surbl/ab+jp+ob+sc+ws+uc3.txt http://ry.ca/surbl/ab+jp+ob+sc+ws+uc4.txt
By definition, 4 is a strict subset of 3, so if FP(n>=N) is the false positive rate of a list with domains in N-or-more lists, FP(n>=3) >= FP(n>=4). Thus, this approach also has the added benefit of allowing you to at least discretely control the FP rate somewhat.
Have fun! - Ryan
On Friday, November 12, 2004, 5:41:26 AM, Ryan Thompson wrote:
Jeff Chan wrote to SURBL Discussion list:
We could probably experiment and try some different approaches and see how they test out on corpora and live mail servers.
A simple join(1) on the data files might be a better start:
SC and AB and WS and JP and OB
Matches 202 records. That's going to have an extremely low detection rate. The problem is that "and" means "intersection", and by including ob in particular, you're automatically limiting the maximum size of the data to about 350 records.
((SC or AB) and (JP or OB))
Matches 1,187 records. Probably still too few.
Don't let the size of the list fool you though. Remember that a few spam gangs send out most of the spam using zombies and other ways that are hard to block with conventional RBLs. Getting their domains at any given time probably does not entail having a list of 100k domains. Just a few hundred domains probably appear in a majority of spams at any particular time. The question is "which hundreds?"... :-)
or PH
Didn't feel like pulling PH out of multi for this test.
Better, IMHO, is to use something like
(SC + AB + JP + OB + WS) >= 3
Matches 16,560 records. Aha! Now we're getting something useful.
Without WS in that equation, the number drops to 906.
Qualifying by the number of lists could be useful to try, though SC and AB should be lumped together since they're both mostly from SpamCop URI reports. In other words SC and AB aren't too independent in terms of their data source. They're mostly a different slice of the same data and should probably be treated as a single souce.
You can try this with different lists if you want, or even mix in some judicious "and" and "or" matching. For instance, since there is a large overlap between jp and ws, you might want to choose one or the other.
All of JP is currently included in WS. They will be more independent when we take JP out of WS, as we're planning to do when SpamAssassin 3.1 gets released.
But maybe it doesn't matter so much, because, in that case, you might just set the cutoff lower to compensate, so having the additional list would still add some small bit of confidence.
To me, 3 currently looks like the likely sweet spot, although the hit rate on the ~2,500 domains present in four or more lists could still potentially put a sizeable dent in spam at the MTA level at a lower FP rate. I'd recommend looking at 3 and 4 a little more closely:
http://ry.ca/surbl/ab+jp+ob+sc+ws+uc3.txt http://ry.ca/surbl/ab+jp+ob+sc+ws+uc4.txt
By definition, 4 is a strict subset of 3, so if FP(n>=N) is the false positive rate of a list with domains in N-or-more lists,
FP(n>>=3) >= FP(n>=4). Thus, this approach also has the added benefit of
allowing you to at least discretely control the FP rate somewhat.
FP rates should increase with "ors" and decrease with "ands". I probably won't be useing UC, but the principle is the same for whatever lists are used.
Thanks for sharing your ideas,
Jeff C. -- "If it appears in hams, then don't list it."
On Fri, 12 Nov 2004 04:31:37 -0800, Jeff Chan wrote:
We could probably experiment and try some different approaches and see how they test out on corpora and live mail servers.
Should be very easy to test with SpamAssassin for anyone with a decent corpus - just write some meta rules to simulate the intersections (or Ryan's suggested additive combinations).
John.
On Friday, November 12, 2004, 7:38:45 AM, John Wilcock wrote:
On Fri, 12 Nov 2004 04:31:37 -0800, Jeff Chan wrote:
We could probably experiment and try some different approaches and see how they test out on corpora and live mail servers.
Should be very easy to test with SpamAssassin for anyone with a decent corpus - just write some meta rules to simulate the intersections (or Ryan's suggested additive combinations).
And I have another technique I can use here: Take the lists and permutations of lists then see what percentage of each of those hit DNS queries matching blocklists in general. Recall that we now have statistics about whitelist, blocklist and unmatched DNS queries sampled from a DNS server. That means we can estimate spam detection rates by lists and permutations of lists purely based on SURBL DNS hits.
This is not as good as proper corpus checks, since our blocklist hits may include some FPs, but it does give some indication of the general spam detection rates of the lists or their permutations. The best of those results could then be checked against hand-checked corpora with some confidence that we're at least checking the most promising ones.
Gonna code this up....
Jeff C. -- "If it appears in hams, then don't list it."
On Saturday, November 13, 2004, 12:14:24 AM, Jeff Chan wrote:
And I have another technique I can use here: Take the lists and permutations of lists then see what percentage of each of those hit DNS queries matching blocklists in general. Recall that we now have statistics about whitelist, blocklist and unmatched DNS queries sampled from a DNS server. That means we can estimate spam detection rates by lists and permutations of lists purely based on SURBL DNS hits.
This is not as good as proper corpus checks, since our blocklist hits may include some FPs, but it does give some indication of the general spam detection rates of the lists or their permutations. The best of those results could then be checked against hand-checked corpora with some confidence that we're at least checking the most promising ones.
OK as advertised, here are some results of looking at the intersections of different lists and seeing how many of the blocklist DNS queries they are responsible for:
[sc][ws][ob][jp] 767 records of 82587 68084 hits of 232031 is 29% [sc][ws][ob] 861 records of 82587 68296 hits of 232031 is 29% [sc][ws][jp] 904 records of 82587 71545 hits of 232031 is 30% [sc][ws] 1068 records of 82587 72565 hits of 232031 is 31% [sc][ob][jp] 793 records of 82587 70468 hits of 232031 is 30% [sc][ob] 920 records of 82587 71218 hits of 232031 is 30% [sc][jp] 939 records of 82587 73947 hits of 232031 is 31% [sc] 1197 records of 82587 76438 hits of 232031 is 32% [ws][ob][jp] 16381 records of 82587 144955 hits of 232031 is 62% [ws][ob] 21788 records of 82587 150104 hits of 232031 is 64% [ws][jp] 33123 records of 82587 186359 hits of 232031 is 80% [ws] 58465 records of 82587 209344 hits of 232031 is 90% [ob][jp] 17143 records of 82587 150525 hits of 232031 is 64% [ob] 44630 records of 82587 167906 hits of 232031 is 72% [jp] 34669 records of 82587 195783 hits of 232031 is 84%
This is for 10 days of queries, with 10,000 sampled every 2 hours. It undercounts the SC hits since those have an inherent time period of 3 days, not 10. The results for SC would be higher when looking at shorter time periods such as 3 days.
Probably the most useful ones to test further, for example against hand-built corpora, would be:
[ws][ob][jp] 16381 records of 82587 144955 hits of 232031 is 62% [ws][ob] 21788 records of 82587 150104 hits of 232031 is 64% [ws][jp] 33123 records of 82587 186359 hits of 232031 is 80% [ob][jp] 17143 records of 82587 150525 hits of 232031 is 64%
[ws][ob][jp] is 127.0.0.84 [ws][ob] is 127.0.0.20 [ws][jp] is 127.0.0.68 [ob][jp] is 127.0.0.80
Theo, Daniel and other SA mass-checkers, would you please consider testing these using urirhsbl to find the results for these as intersections (instead of the usual individual lists with urirhssub)?
We'd be particularly interested to see if any of these intersections have unusually low FP rates.
Jeff C. -- "If it appears in hams, then don't list it."
On Saturday, November 13, 2004, 4:28:52 AM, Jeff Chan wrote:
OK as advertised, here are some results of looking at the intersections of different lists and seeing how many of the blocklist DNS queries they are responsible for:
[sc][ws][ob][jp] 767 records of 82587 68084 hits of 232031 is 29% [sc][ws][ob] 861 records of 82587 68296 hits of 232031 is 29% [sc][ws][jp] 904 records of 82587 71545 hits of 232031 is 30% [sc][ws] 1068 records of 82587 72565 hits of 232031 is 31%
[...]
For completeness, I've added checking of AB and PH as individual (not permuted) lists, and the output can be found at:
http://www.surbl.org/permuted-hits.out.txt
[sc][ws][ob][jp] 762 records of 82592 67115 hits of 232463 is 28% [sc][ws][ob] 857 records of 82592 67325 hits of 232463 is 28% [sc][ws][jp] 899 records of 82592 70622 hits of 232463 is 30% [sc][ws] 1066 records of 82592 71725 hits of 232463 is 30% [sc][ob][jp] 788 records of 82592 69526 hits of 232463 is 29% [sc][ob] 916 records of 82592 70292 hits of 232463 is 30% [sc][jp] 934 records of 82592 73050 hits of 232463 is 31% [sc] 1193 records of 82592 75597 hits of 232463 is 32% [ws][ob][jp] 16383 records of 82592 144989 hits of 232463 is 62% [ws][ob] 21793 records of 82592 150159 hits of 232463 is 64% [ws][jp] 33123 records of 82592 186633 hits of 232463 is 80% [ws] 58471 records of 82592 209710 hits of 232463 is 90% [ob][jp] 17145 records of 82592 150595 hits of 232463 is 64% [ob] 44636 records of 82592 168053 hits of 232463 is 72% [jp] 34669 records of 82592 196112 hits of 232463 is 84% [ab] 368 records of 82592 61920 hits of 232463 is 26% [ph] 996 records of 82592 307 hits of 232463 is 0%
It is run nightly around midnight using the script:
http://www.surbl.org/permuted-hits
This gives some measure of the performance of the different lists, though it likely undercounts rapidly changing data since it's based on the previous ten days of data. The more quickly changing lists like AB and SC have higher detection rates in actual, real-time operation.
Jeff C. -- "If it appears in hams, then don't list it."
On Friday, November 12, 2004, 4:03:26 AM, Alex Broens wrote:
JP & SC & AB have been 0 FP for me.
OB more FPs than expected - the direct whitelisting site has been a
godsend. OB guys very fast.
- WS 0 FPs since last clean up. (as FP are not always made pubic its
hard to say)
- SC & AB are very slow in publishing new data - JP & WS beat them all
This would mean less subzones to reduce DNS queries overall allowing faster SA processing
my vote: merge JP,SC,AB. - "safe.surb.org"
Sorry, forgot to add PH to "safe.surbl.org"
Thanks for the feedback on the lists Alex! "And"ing some of them together should reduce the FPs. For example, even if OB or WS had FPs, "and"ing them with SC or AB would reduce the FP levels. Unfortunately it also lowers the spam detection rates.
Jeff C. -- "If it appears in hams, then don't list it."
On Fri, 12 Nov 2004 02:45:15 -0800, Jeff Chan jeffc@surbl.org wrote:
Pondering the question of how to make a "telco grade" SURBL that had as close to zero false positives as possible, but would still catche many spams, I remembered that many of the biggest spam domains seem to appear in several different SURBL lists.
What does anyone think about creating a "consensus" list that a telco or ISP might use to block at the MTA level?
For example a domain that appears on:
((SC or AB) and (JP or OB)) or PH
I think the percentile based lists are probably the best way to go - ie. top 50% of all requested surbl listed domains or something like that?
We should probably work on developing some more diverse spamtrap feeds. Quite a lot of ISP's have well established spamtraps that they are either not using or are completely underutilising.
Lists like SC, AB and JP all seem to be good data sources, but if you were trying to be certain of 0 FP's you'd need something to reliably and continuously rebuild your data against and from.
On Friday, November 12, 2004, 4:00:47 AM, David Hooton wrote:
On Fri, 12 Nov 2004 02:45:15 -0800, Jeff Chan jeffc@surbl.org wrote:
Pondering the question of how to make a "telco grade" SURBL that had as close to zero false positives as possible, but would still catche many spams, I remembered that many of the biggest spam domains seem to appear in several different SURBL lists.
What does anyone think about creating a "consensus" list that a telco or ISP might use to block at the MTA level?
For example a domain that appears on:
((SC or AB) and (JP or OB)) or PH
I think the percentile based lists are probably the best way to go - ie. top 50% of all requested surbl listed domains or something like that?
Percentiles are good, but they're only possible when you have frequencies of reports, queries, etc. The only list I have report frequencies for is SC, so it's not possible for me to compare percentiles across other lists.
One thing we could take percentiles on is DNS queries, and that could be useful, but it doesn't exclude FPs. If we didn't whitelist w3.org for example, it would have lots of DNS query FPs. Frequencies of DNS query hits against blocklists could get us an approximation of the "top spammers" with some possible FPs included among the most frequent queries.
We should probably work on developing some more diverse spamtrap feeds. Quite a lot of ISP's have well established spamtraps that they are either not using or are completely underutilising.
Lists like SC, AB and JP all seem to be good data sources, but if you were trying to be certain of 0 FP's you'd need something to reliably and continuously rebuild your data against and from.
More traps and more data are definitely desirable, but we're also interested in seeing if we can make smarter use of the existing data, so thanks for your suggestions.
Jeff C. -- "If it appears in hams, then don't list it."