Large ham corpus hits against SURBLs - Discuss

10 Sep 2004


      I've extracted the plaintext * URI domains from a 14 GB ham corpus,
taken the top 70th and 85th percentiles of the most frequently
occurring domains and compared them against all SURBL domains,
the master list of which can be found at:
http://spamcheck.freeapp.net/multi.domains.sort
At the 70th percentile level, there were only two matches:
automotivedigest.com
  processrequest.com
At the 85th percentile there were a few more:
automotivedigest.com
  chartshop.com
  ct002.com
  dakotaairparts.com
  hallogram.com
  infoaeroplan.ca
  investorsinsight.com
  processrequest.com
  sitepronews.com
  topachat.com
These are arguably false positives.  What do we know about them.
Should we whitelist or not whitelist any?
* looking at plaintext has advantages and disadvantages:
1. quick and easy
2. does not "double or triple count" messages which also
have BASE 64 or quoted printable encoded versions of the same URIs
3. misses some such encoded URIs which don't have plaintext
equivalents in a different part of the message
Nonetheless the data are still probably generally useful.
Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/