[SURBL-Discuss] Large ham corpus hits against SURBLs

Jeff Chan jeffc at surbl.org
Fri Sep 10 08:39:49 CEST 2004


I've extracted the plaintext * URI domains from a 14 GB ham corpus,
taken the top 70th and 85th percentiles of the most frequently
occurring domains and compared them against all SURBL domains,
the master list of which can be found at:

  http://spamcheck.freeapp.net/multi.domains.sort

At the 70th percentile level, there were only two matches:

  automotivedigest.com
  processrequest.com

At the 85th percentile there were a few more:

  automotivedigest.com
  chartshop.com
  ct002.com
  dakotaairparts.com
  hallogram.com
  infoaeroplan.ca
  investorsinsight.com
  processrequest.com
  sitepronews.com
  topachat.com

These are arguably false positives.  What do we know about them.
Should we whitelist or not whitelist any?


* looking at plaintext has advantages and disadvantages:

1. quick and easy
2. does not "double or triple count" messages which also
have BASE 64 or quoted printable encoded versions of the same URIs
3. misses some such encoded URIs which don't have plaintext
equivalents in a different part of the message

Nonetheless the data are still probably generally useful.

Jeff C.
-- 
Jeff Chan
mailto:jeffc at surbl.org
http://www.surbl.org/



More information about the Discuss mailing list