I've extracted the plaintext * URI domains from a 14 GB ham corpus, taken the top 70th and 85th percentiles of the most frequently occurring domains and compared them against all SURBL domains, the master list of which can be found at:
http://spamcheck.freeapp.net/multi.domains.sort
At the 70th percentile level, there were only two matches:
automotivedigest.com processrequest.com
At the 85th percentile there were a few more:
automotivedigest.com chartshop.com ct002.com dakotaairparts.com hallogram.com infoaeroplan.ca investorsinsight.com processrequest.com sitepronews.com topachat.com
These are arguably false positives. What do we know about them. Should we whitelist or not whitelist any?
* looking at plaintext has advantages and disadvantages:
1. quick and easy 2. does not "double or triple count" messages which also have BASE 64 or quoted printable encoded versions of the same URIs 3. misses some such encoded URIs which don't have plaintext equivalents in a different part of the message
Nonetheless the data are still probably generally useful.
Jeff C.