I've extracted the plaintext * URI domains from a 14 GB ham corpus, taken the top 70th and 85th percentiles of the most frequently occurring domains and compared them against all SURBL domains, the master list of which can be found at:
http://spamcheck.freeapp.net/multi.domains.sort
At the 70th percentile level, there were only two matches:
automotivedigest.com processrequest.com
At the 85th percentile there were a few more:
automotivedigest.com chartshop.com ct002.com dakotaairparts.com hallogram.com infoaeroplan.ca investorsinsight.com processrequest.com sitepronews.com topachat.com
These are arguably false positives. What do we know about them. Should we whitelist or not whitelist any?
* looking at plaintext has advantages and disadvantages:
1. quick and easy 2. does not "double or triple count" messages which also have BASE 64 or quoted printable encoded versions of the same URIs 3. misses some such encoded URIs which don't have plaintext equivalents in a different part of the message
Nonetheless the data are still probably generally useful.
Jeff C.
For the record, here are the SURBL list hits:
automotivedigest.com [ws] chartshop.com [ws] ct002.com [ws][ob] dakotaairparts.com [ws] hallogram.com [ws] infoaeroplan.ca [ob] investorsinsight.com [ws] processrequest.com [ws] sitepronews.com [ws] topachat.com [ws]
Jeff C.
At the 85th percentile there were a few more:
automotivedigest.com chartshop.com ct002.com dakotaairparts.com hallogram.com infoaeroplan.ca investorsinsight.com processrequest.com sitepronews.com topachat.com
These are arguably false positives. What do we know about them. Should we whitelist or not whitelist any?
I checked for overlaps with my blacklists.
ct002.com goes with 123greetings.com, which is *not* blacklisted on SURBL.
I blacklisted ct002.com on September 3, 2004 when I found it in a spammy-looking mail from Raymond's spamfeed. It was less than a year old and here's what SA thought about the triggering message:
spam, SpamAssassin (score=15.844, required 5, BAYES_99 5.40, CLICK_BELOW 0.10, HTML_FONT_INVISIBLE 0.60, HTML_MESSAGE 0.10, MIME_HTML_ONLY 0.32, MSGID_FROM_MTA_HEADER 0.70, OUTBLAZE_URI_RBL 3.50, RATWARE_HASH_2_V2 1.62, WS_URI_RBL 3.50)
So obviously my listing wasn't the first one on SURBL. I can't rule out that the mail was solicited though.
Joe
On Friday, September 10, 2004, 5:05:17 AM, Joe Wein wrote:
At the 85th percentile there were a few more:
automotivedigest.com chartshop.com ct002.com dakotaairparts.com hallogram.com infoaeroplan.ca investorsinsight.com processrequest.com sitepronews.com topachat.com
These are arguably false positives. What do we know about them. Should we whitelist or not whitelist any?
I checked for overlaps with my blacklists.
ct002.com goes with 123greetings.com, which is *not* blacklisted on SURBL.
I blacklisted ct002.com on September 3, 2004 when I found it in a spammy-looking mail from Raymond's spamfeed. It was less than a year old and here's what SA thought about the triggering message:
spam, SpamAssassin (score=15.844, required 5, BAYES_99 5.40, CLICK_BELOW 0.10, HTML_FONT_INVISIBLE 0.60, HTML_MESSAGE 0.10, MIME_HTML_ONLY 0.32, MSGID_FROM_MTA_HEADER 0.70, OUTBLAZE_URI_RBL 3.50, RATWARE_HASH_2_V2 1.62, WS_URI_RBL 3.50)
So obviously my listing wasn't the first one on SURBL. I can't rule out that the mail was solicited though.
Joe
Thanks for the checking and the feedback Joe! :-)
Does anyone have any info on the others?
Jeff C.