[SURBL-Discuss] Possible large whitelist from DMOZ data

Wed Oct 6 11:58:08 CEST 2004

Daniel Quinlan, one of the principal SpamAssassin architects had
some good suggestions for reducing false positives in the SURBL
data.  One was using public databases of URIs, particularly
hand-built ones like dmoz.org and wikipedia.org or even yahoo.com
as sources of mostly legitimate domains.  (The wikipedia is not a
web directory in a conventional sense; it's more like an open
encyclopedia, but it has a relatively large collection of URIs.)

Presumably most of the URIs in these are legitimate and don't
belong to spammers, especially in DMOZ since it's hand-built.
So the question is: can these be useful as whitelist sources or
perhaps as one of the checks on new SURBL additions.

The DMOZ open directory publishes it's data in RDF form at:

  http://rdf.dmoz.org/

So we downloaded the URL data, extracted the domains and
compared them against the SURBL block and whitelists:

% join dmoz.srt ../multi.domains.sort | wc
    1338    1338   20533
% join dmoz.srt ../whitelist-domains.sort | wc
    7375    7375   96720
% join dmoz.srt ../multi.domains.sort > dmoz-blocklist.txt
% join dmoz.srt ../whitelist-domains.sort > dmoz-whitelist.txt

There were 1338 DMOZ hits against our blocklisted domains and
7375 against our whitelists.  You can view those matches at:

  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt
  http://spamcheck.freeapp.net/whitelists/dmoz-whitelist.txt

Of the 1338 DMOZ hits against our blocklists, which arguably
could be false positives, most are in WS.  Here is a list with
the data from multi.surbl.org showing list membership included:

%join dmoz.srt ../multi.domains.summed > dmoz-blocklist.summed.txt

  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.summed.txt

And some list counts from those hits:

    [ws] hits:     1173
    [ob] hits:      165
    [jp] hits:       61
    [sc] hits:        8
    [ab] hits:        4
    [ph] hits:        2

These add up to more than 1338 since some records hit multiple
lists.  The actual hits are in:

  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws
  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ob
  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.jp
  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.sc
  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ab
  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ph

Data source folks, please review these and try to determine
which ones are FPs and which would result in false negatives
if they came off the lists.  For ones that are FPs you may
want to eliminate them on your end.  For the ones that could
cause FNs, we'd like to know about those as a measure of
using the DMOZ data for whitelisting.  Right now I'm
leaning towards whitelisting all of these, so please speak
up!

The 7.4k DMOZ whitelist hits represents a majority of the 12.25k
whitelist entries that are not reserved .us geographic domains,
so there is significant overlap between DMOZ and our existing
whitelists, which is probably speaks well for both lists.

% wc dotus_reservedlist_v3.lower.sort
   52049   52049 1012735 dotus_reservedlist_v3.lower.sort

% wc ../whitelist-domains.sort
   64299   64299 1169155 ../whitelist-domains.sort

% join dmoz.srt  dotus_reservedlist_v3.lower.sort | wc
       7       7     112

The DMOZ data has about 2.3 million domains.  How does anyone
feel about adding them to our whitelists?   A 1.2 MB gzip of
the extracted domains is at:

  http://spamcheck.freeapp.net/whitelists/dmoz.srt.gz

I think we can safely say that whitelisting DMOZ domains
will reduce FPs.  Probably a more important question is: how many
FNs would that cause?  In other words, how many purely spam
domains are in DMOZ, where whitelisting them would wrongly
exclude spam domains from SURBLs?

One way to answer that is to note that the lists ab, sc, jp, ph,
which have much lower FP rates than ws (measured by the
SpamAssassin corpora checks, for example, and also anecdotally by
human FP reports) appear relatively infrequently in the DMOZ
hits.  In other words, SURBL lists that we know are quite spammy
like sc, jp, etc. don't match DMOZ often, so the DMOZ data may
not have too many spam domains.

Similar tests could be done against other proposed whitelists.
(We'll probably try the wikipedia data next.)

Another concern is that since these directories are relatively
open, spammers could simply add themselves and effectively get
whitelisted.  However I intend to take a snapshot of these and
probably not try to refresh the data very often in future (it at
all), instead using them as relatively static snapshots of
established domains.  Doing that would miss some new additions,
but could also prevent some future abuse by spammers.  On the
other hand 2 million domains is a pretty good start....  :-)

Extraction scripts are not perfect, particularly in the
simplistic chopping to three levels of cctlds, but they're
probably adequate:

  http://spamcheck.freeapp.net/whitelists/extract-dmoz-domains
  http://spamcheck.freeapp.net/whitelists/chop-two-level-domains.sed
  http://spamcheck.freeapp.net/whitelists/reduce-to-third-level.sed

Comments please,

Jeff C.
-- 
Jeff Chan
mailto:jeffc at surbl.org
http://www.surbl.org/