As has already been mentioned, Theo is patching SpamAssassin to locally whitelist some common whitehat URI domains for use in URIBL (which typically uses sbl and SURBL data) . This will prevent DNS queries on the whitehats and probably save some very significant traffic on the SURBL and spamhaus, etc. name servers.
In order to get some better whitehat data, we increased the sampling of DNS queries on a name server from 32k (2k every 3 hours for 2 days) to 1.2 million (10k every 2 hours for 10 days). We're only about a third of the way through the initial 10 days, so the stats are still building up, but the current results are at:
http://www.surbl.org/dns-queries.whitelist.counts.txt http://www.surbl.org/dns-queries.blocklist.counts.txt
(These files have been mentioned before, but they're starting to get a lot more data behind them now.)
Something else which was probably suggestion before, but which we *hadn't looked at before* were the DNS queries that *don't match* either our blocklists or whitelists. Those, sorted in order of decreasing frequency are:
http://www.surbl.org/dns-queries.unmatched.30thpercentile.txt
That's the top 30th percentile of them (about 3.6k records). The full list of unique domains and IPs with frequencies (about 110k records) is at:
http://www.surbl.org/dns-queries.unmatched.count.txt
Taking a look the top few of these:
333 56.227.117.38 211 internet.e 196 wwwlowmortnow.info 123 beliefnet.com 119 grisoft.com) 107 specialmax.net 99 democrats.org 99 115.14.249.209 96 and 90 c 82 centrport.net 78 charter.net 73 zdnet.com 65 cf.st 63 nuri1.net 62 red-hot1.com 62 imomentum.net 61 justsaywow.com 60 173.213.115.211 59 www 57 www.cool-loanco.kr 57 superduperfun.com 53 healthinsrus.com 51 e-directnet.net 51 agoramail.net 50 tmcs.net 50 latimes.com 50 dw.com.com 50 168.228.186.64 49 iscsimg.com 48 livedaily.com 48 eversave.com 47 1shoppingcart.com 46 srvimg.com 46 realone.com 46 goodnewsdelivery.com 45 rockbridgemedia.com 45 purdue.edu
It's clear that a few are errors, probably due to problems in the applications using SURBLs. Yet it's probably useful to not suppress the errors so that the programs can be updated to handle them correctly. (Unfortunately the source URIs generating the errors are not directly available, but they may be identifiable in other ways if anyone would like to look for them.)
Minus the errors, I fed this list into Ryan's GetURI to see what it could find. The results are at:
http://ry.ca/cgi-bin/geturi.cgi?id=ham-5lCTzHkxan3xE38RKHa0vx
Quite a few appear ok to whitelist, like democrats.org, pudue.edu, latimes.com, charter.net, zdnet.com, etc. and I'll probably go ahead and whitelist obvious ones like these, so some of these will probably be off this "unmatched" list and onto the whitelist hits by the time you read this.
Nonetheless I recommend we all take a look at this unmatched list periodically, especially the top few dozen, to look for potential domains to whitelist or blacklist. These most frequently appearing domains are probably good candidates for one or the other.
Since this is a list of the unknown "wild" domains coming from live, real-world message URIs, it may be another useful and different source of some data.
Cheers,
Jeff C. -- "If it appears in hams, then don't list it."
On Sunday, October 10, 2004, 12:52:59 AM, Jeff Chan wrote:
http://www.surbl.org/dns-queries.unmatched.30thpercentile.txt
Quite a few appear ok to whitelist, like democrats.org, pudue.edu, latimes.com, charter.net, zdnet.com, etc. and I'll probably go ahead and whitelist obvious ones like these, so some of these will probably be off this "unmatched" list and onto the whitelist hits by the time you read this.
OK I went ahead and whitelisted a few dozen obvious whitehats from this unmatched DNS query data:
http://spamcheck.freeapp.net/whitelists/unmatched-9oct04.sort
Most of them are very obviously whitehats. A couple I had to look up. None of these are FPs; I'm just adding them to keep them off the lists, and eventually out of the DNS queries from SpamAssassin.
At the same time I noticed a couple domains in these that belong to some companies that have multiple legitimate domains.
For example, tmcs.net belongs to ticketmasters, which in turn belongs to a consolidation company iac.com, which owns many other internet content companies like expedia, match.com, etc. Since they all have apparent legitimate uses I tried to find most of them, then whitelisted them all as:
http://spamcheck.freeapp.net/whitelists/iac.sort
Likewise Reed Electronics publishes many professional electronics journals like Electronic Design News (EDN), and they have essentially a links page to their various organizations, so I whitelisted them all:
http://spamcheck.freeapp.net/whitelists/reedelectronics.sort
Reed Electronics is owned by elsevier.com.
There were also a few stray yahoo domains in those top unmatched queries that we didn't already have whitelisted, so I took our existing yahoo domain list and added more hand-checked yahoo domains from the DMOZ, Wikipedia and DNS query data into:
http://spamcheck.freeapp.net/whitelists/yahoo.sort
It's still probably not a complete list of yahoo domains, but it's a pretty good start, especially based on volume of queries.
On the blackhat side, there are definitely some spammers in the top unmatched DNS queries that should be checked and probably listed in SURBLs:
http://www.surbl.org/dns-queries.unmatched.30thpercentile.txt
I'll leave it to some of you folks who enjoy listing spammers (more than whitelisting hammers ;-) to look into some of these....
Since spam domains tend to be a lot more dynamic than ham domains, I'd recommend checking this list every few days.
Certainly there are more ham domains in there also and if anyone spots any, please report or whitelist them.
Cheers,
Jeff C. -- "If it appears in hams, then don't list it."