As has already been mentioned, Theo is patching SpamAssassin to locally whitelist some common whitehat URI domains for use in URIBL (which typically uses sbl and SURBL data) . This will prevent DNS queries on the whitehats and probably save some very significant traffic on the SURBL and spamhaus, etc. name servers.
In order to get some better whitehat data, we increased the sampling of DNS queries on a name server from 32k (2k every 3 hours for 2 days) to 1.2 million (10k every 2 hours for 10 days). We're only about a third of the way through the initial 10 days, so the stats are still building up, but the current results are at:
http://www.surbl.org/dns-queries.whitelist.counts.txt http://www.surbl.org/dns-queries.blocklist.counts.txt
(These files have been mentioned before, but they're starting to get a lot more data behind them now.)
Something else which was probably suggestion before, but which we *hadn't looked at before* were the DNS queries that *don't match* either our blocklists or whitelists. Those, sorted in order of decreasing frequency are:
http://www.surbl.org/dns-queries.unmatched.30thpercentile.txt
That's the top 30th percentile of them (about 3.6k records). The full list of unique domains and IPs with frequencies (about 110k records) is at:
http://www.surbl.org/dns-queries.unmatched.count.txt
Taking a look the top few of these:
333 56.227.117.38 211 internet.e 196 wwwlowmortnow.info 123 beliefnet.com 119 grisoft.com) 107 specialmax.net 99 democrats.org 99 115.14.249.209 96 and 90 c 82 centrport.net 78 charter.net 73 zdnet.com 65 cf.st 63 nuri1.net 62 red-hot1.com 62 imomentum.net 61 justsaywow.com 60 173.213.115.211 59 www 57 www.cool-loanco.kr 57 superduperfun.com 53 healthinsrus.com 51 e-directnet.net 51 agoramail.net 50 tmcs.net 50 latimes.com 50 dw.com.com 50 168.228.186.64 49 iscsimg.com 48 livedaily.com 48 eversave.com 47 1shoppingcart.com 46 srvimg.com 46 realone.com 46 goodnewsdelivery.com 45 rockbridgemedia.com 45 purdue.edu
It's clear that a few are errors, probably due to problems in the applications using SURBLs. Yet it's probably useful to not suppress the errors so that the programs can be updated to handle them correctly. (Unfortunately the source URIs generating the errors are not directly available, but they may be identifiable in other ways if anyone would like to look for them.)
Minus the errors, I fed this list into Ryan's GetURI to see what it could find. The results are at:
http://ry.ca/cgi-bin/geturi.cgi?id=ham-5lCTzHkxan3xE38RKHa0vx
Quite a few appear ok to whitelist, like democrats.org, pudue.edu, latimes.com, charter.net, zdnet.com, etc. and I'll probably go ahead and whitelist obvious ones like these, so some of these will probably be off this "unmatched" list and onto the whitelist hits by the time you read this.
Nonetheless I recommend we all take a look at this unmatched list periodically, especially the top few dozen, to look for potential domains to whitelist or blacklist. These most frequently appearing domains are probably good candidates for one or the other.
Since this is a list of the unknown "wild" domains coming from live, real-world message URIs, it may be another useful and different source of some data.
Cheers,
Jeff C. -- "If it appears in hams, then don't list it."