[SURBL-Discuss] RFC: consensus list?

Jeff Chan jeffc at surbl.org
Sat Nov 13 13:28:52 CET 2004


On Saturday, November 13, 2004, 12:14:24 AM, Jeff Chan wrote:
> And I have another technique I can use here:  Take the lists
> and permutations of lists then see what percentage of each of
> those hit DNS queries matching blocklists in general.  Recall
> that we now have statistics about whitelist, blocklist and
> unmatched DNS queries sampled from a DNS server.  That means
> we can estimate spam detection rates by lists and permutations
> of lists purely based on SURBL DNS hits.

> This is not as good as proper corpus checks, since our
> blocklist hits may include some FPs, but it does give some
> indication of the general spam detection rates of the lists
> or their permutations.  The best of those results could then
> be checked against hand-checked corpora with some confidence
> that we're at least checking the most promising ones.

OK as advertised, here are some results of looking at the
intersections of different lists and seeing how many of
the blocklist DNS queries they are responsible for:

[sc][ws][ob][jp] 767 records of 82587  68084 hits of 232031 is 29%
[sc][ws][ob]     861 records of 82587  68296 hits of 232031 is 29%
[sc][ws][jp]     904 records of 82587  71545 hits of 232031 is 30%
[sc][ws]        1068 records of 82587  72565 hits of 232031 is 31%
[sc][ob][jp]     793 records of 82587  70468 hits of 232031 is 30%
[sc][ob]         920 records of 82587  71218 hits of 232031 is 30%
[sc][jp]         939 records of 82587  73947 hits of 232031 is 31%
[sc]            1197 records of 82587  76438 hits of 232031 is 32%
[ws][ob][jp]   16381 records of 82587 144955 hits of 232031 is 62%
[ws][ob]       21788 records of 82587 150104 hits of 232031 is 64%
[ws][jp]       33123 records of 82587 186359 hits of 232031 is 80%
[ws]           58465 records of 82587 209344 hits of 232031 is 90%
[ob][jp]       17143 records of 82587 150525 hits of 232031 is 64%
[ob]           44630 records of 82587 167906 hits of 232031 is 72%
[jp]           34669 records of 82587 195783 hits of 232031 is 84%

This is for 10 days of queries, with 10,000 sampled every 2
hours.  It undercounts the SC hits since those have an
inherent time period of 3 days, not 10.  The results for
SC would be higher when looking at shorter time periods
such as 3 days.

Probably the most useful ones to test further, for example
against hand-built corpora, would be:

[ws][ob][jp]   16381 records of 82587 144955 hits of 232031 is 62%
[ws][ob]       21788 records of 82587 150104 hits of 232031 is 64%
[ws][jp]       33123 records of 82587 186359 hits of 232031 is 80%
[ob][jp]       17143 records of 82587 150525 hits of 232031 is 64%

[ws][ob][jp]  is  127.0.0.84
[ws][ob]      is  127.0.0.20
[ws][jp]      is  127.0.0.68
[ob][jp]      is  127.0.0.80

Theo, Daniel and other SA mass-checkers, would you please
consider testing these using urirhsbl to find the results for
these as intersections (instead of the usual individual lists
with urirhssub)?

We'd be particularly interested to see if any of these
intersections have unusually low FP rates.

Jeff C.
--
"If it appears in hams, then don't list it."



More information about the Discuss mailing list