In order to explore some conjunctions of existing lists that might have fewer false positives, we've created some stats measuring the number of DNS hits against all blocklists that the individual lists get, along with some permutations of those lists.
For completeness, we've added checking of AB and PH as individual (not permuted) lists, and the output can be found at:
http://www.surbl.org/permuted-hits.out.txt
[sc][ws][ob][jp] 762 records of 82592 67115 hits of 232463 is 28% [sc][ws][ob] 857 records of 82592 67325 hits of 232463 is 28% [sc][ws][jp] 899 records of 82592 70622 hits of 232463 is 30% [sc][ws] 1066 records of 82592 71725 hits of 232463 is 30% [sc][ob][jp] 788 records of 82592 69526 hits of 232463 is 29% [sc][ob] 916 records of 82592 70292 hits of 232463 is 30% [sc][jp] 934 records of 82592 73050 hits of 232463 is 31% [sc] 1193 records of 82592 75597 hits of 232463 is 32% [ws][ob][jp] 16383 records of 82592 144989 hits of 232463 is 62% [ws][ob] 21793 records of 82592 150159 hits of 232463 is 64% [ws][jp] 33123 records of 82592 186633 hits of 232463 is 80% [ws] 58471 records of 82592 209710 hits of 232463 is 90% [ob][jp] 17145 records of 82592 150595 hits of 232463 is 64% [ob] 44636 records of 82592 168053 hits of 232463 is 72% [jp] 34669 records of 82592 196112 hits of 232463 is 84% [ab] 368 records of 82592 61920 hits of 232463 is 26% [ph] 996 records of 82592 307 hits of 232463 is 0%
The records columns show the size of the lists or intersections. The hits columns shows how many DNS queries out of all blocklist hits those lists or intersections.
This is run nightly around midnight using the script:
http://www.surbl.org/permuted-hits
This gives some measure of the performance of the different lists, though it likely undercounts rapidly changing data since it's based on the previous ten days of data. The more quickly changing lists like AB and SC have higher detection rates in actual, real-time operation. The stats above also do not take into account false positives at all, just hits against existing blocklists (which do have some FPs).
Additionally we've increased the number of days that frequency data of DNS queries against our whitelist are kept from 10 to 90. However it will take another 80 days before we get 90 days accumulated:
http://www.surbl.org/dns-queries.whitelist.counts.txt
I re-wrote the scripts to accommodate the additional data more efficiently:
http://www.surbl.org/hourly-dns http://www.surbl.org/daily-dns (run around midnight)
After things stabilize we will probably change from the current 10,000 DNS queries sampled every 2 hours to 20,000 sampled every hour. This will make the results represent about a half million queries per day. The larger sample sizes should make the results more accurate, but it means absolute numbers from the past won't be comparable. Percentages and relative rankings should always be comparable though.
Jeff C.