On Saturday, November 13, 2004, 12:14:24 AM, Jeff Chan wrote:
And I have another technique I can use here: Take the lists and permutations of lists then see what percentage of each of those hit DNS queries matching blocklists in general. Recall that we now have statistics about whitelist, blocklist and unmatched DNS queries sampled from a DNS server. That means we can estimate spam detection rates by lists and permutations of lists purely based on SURBL DNS hits.
This is not as good as proper corpus checks, since our blocklist hits may include some FPs, but it does give some indication of the general spam detection rates of the lists or their permutations. The best of those results could then be checked against hand-checked corpora with some confidence that we're at least checking the most promising ones.
OK as advertised, here are some results of looking at the intersections of different lists and seeing how many of the blocklist DNS queries they are responsible for:
[sc][ws][ob][jp] 767 records of 82587 68084 hits of 232031 is 29% [sc][ws][ob] 861 records of 82587 68296 hits of 232031 is 29% [sc][ws][jp] 904 records of 82587 71545 hits of 232031 is 30% [sc][ws] 1068 records of 82587 72565 hits of 232031 is 31% [sc][ob][jp] 793 records of 82587 70468 hits of 232031 is 30% [sc][ob] 920 records of 82587 71218 hits of 232031 is 30% [sc][jp] 939 records of 82587 73947 hits of 232031 is 31% [sc] 1197 records of 82587 76438 hits of 232031 is 32% [ws][ob][jp] 16381 records of 82587 144955 hits of 232031 is 62% [ws][ob] 21788 records of 82587 150104 hits of 232031 is 64% [ws][jp] 33123 records of 82587 186359 hits of 232031 is 80% [ws] 58465 records of 82587 209344 hits of 232031 is 90% [ob][jp] 17143 records of 82587 150525 hits of 232031 is 64% [ob] 44630 records of 82587 167906 hits of 232031 is 72% [jp] 34669 records of 82587 195783 hits of 232031 is 84%
This is for 10 days of queries, with 10,000 sampled every 2 hours. It undercounts the SC hits since those have an inherent time period of 3 days, not 10. The results for SC would be higher when looking at shorter time periods such as 3 days.
Probably the most useful ones to test further, for example against hand-built corpora, would be:
[ws][ob][jp] 16381 records of 82587 144955 hits of 232031 is 62% [ws][ob] 21788 records of 82587 150104 hits of 232031 is 64% [ws][jp] 33123 records of 82587 186359 hits of 232031 is 80% [ob][jp] 17143 records of 82587 150525 hits of 232031 is 64%
[ws][ob][jp] is 127.0.0.84 [ws][ob] is 127.0.0.20 [ws][jp] is 127.0.0.68 [ob][jp] is 127.0.0.80
Theo, Daniel and other SA mass-checkers, would you please consider testing these using urirhsbl to find the results for these as intersections (instead of the usual individual lists with urirhssub)?
We'd be particularly interested to see if any of these intersections have unusually low FP rates.
Jeff C. -- "If it appears in hams, then don't list it."