[SURBL-Discuss] Re: Possible large whitelist from DMOZ data

Wed Oct 6 15:37:55 CEST 2004

Hi Jeff,

You might want to reconsider your use of the entire DMOZ directory.
There may be some subtrees that you can ignore.  Of the 1338 DMOZ false
positives, how many of them are from the same sections on DMOZ?

Henry

Jeff Chan wrote:

>Daniel Quinlan, one of the principal SpamAssassin architects had
>some good suggestions for reducing false positives in the SURBL
>data.  One was using public databases of URIs, particularly
>hand-built ones like dmoz.org and wikipedia.org or even yahoo.com
>as sources of mostly legitimate domains.  (The wikipedia is not a
>web directory in a conventional sense; it's more like an open
>encyclopedia, but it has a relatively large collection of URIs.)
>
>Presumably most of the URIs in these are legitimate and don't
>belong to spammers, especially in DMOZ since it's hand-built.
>So the question is: can these be useful as whitelist sources or
>perhaps as one of the checks on new SURBL additions.
>
>The DMOZ open directory publishes it's data in RDF form at:
>
>  http://rdf.dmoz.org/
>
>So we downloaded the URL data, extracted the domains and
>compared them against the SURBL block and whitelists:
>
>% join dmoz.srt ../multi.domains.sort | wc
>    1338    1338   20533
>% join dmoz.srt ../whitelist-domains.sort | wc
>    7375    7375   96720
>% join dmoz.srt ../multi.domains.sort > dmoz-blocklist.txt
>% join dmoz.srt ../whitelist-domains.sort > dmoz-whitelist.txt
>
>There were 1338 DMOZ hits against our blocklisted domains and
>7375 against our whitelists.  You can view those matches at:
>
>  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt
>  http://spamcheck.freeapp.net/whitelists/dmoz-whitelist.txt
>
>Of the 1338 DMOZ hits against our blocklists, which arguably
>could be false positives, most are in WS.  Here is a list with
>the data from multi.surbl.org showing list membership included:
>
>%join dmoz.srt ../multi.domains.summed > dmoz-blocklist.summed.txt
>
>  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.summed.txt
>
>And some list counts from those hits:
>
>    [ws] hits:     1173
>    [ob] hits:      165
>    [jp] hits:       61
>    [sc] hits:        8
>    [ab] hits:        4
>    [ph] hits:        2
>
>These add up to more than 1338 since some records hit multiple
>lists.  The actual hits are in:
>
>  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws
>  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ob
>  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.jp
>  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.sc
>  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ab
>  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ph
>
>Data source folks, please review these and try to determine
>which ones are FPs and which would result in false negatives
>if they came off the lists.  For ones that are FPs you may
>want to eliminate them on your end.  For the ones that could
>cause FNs, we'd like to know about those as a measure of
>using the DMOZ data for whitelisting.  Right now I'm
>leaning towards whitelisting all of these, so please speak
>up!
>
>The 7.4k DMOZ whitelist hits represents a majority of the 12.25k
>whitelist entries that are not reserved .us geographic domains,
>so there is significant overlap between DMOZ and our existing
>whitelists, which is probably speaks well for both lists.
>
>% wc dotus_reservedlist_v3.lower.sort
>   52049   52049 1012735 dotus_reservedlist_v3.lower.sort
>
>% wc ../whitelist-domains.sort
>   64299   64299 1169155 ../whitelist-domains.sort
>
>% join dmoz.srt  dotus_reservedlist_v3.lower.sort | wc
>       7       7     112
>
>The DMOZ data has about 2.3 million domains.  How does anyone
>feel about adding them to our whitelists?   A 1.2 MB gzip of
>the extracted domains is at:
>
>  http://spamcheck.freeapp.net/whitelists/dmoz.srt.gz
>
>I think we can safely say that whitelisting DMOZ domains
>will reduce FPs.  Probably a more important question is: how many
>FNs would that cause?  In other words, how many purely spam
>domains are in DMOZ, where whitelisting them would wrongly
>exclude spam domains from SURBLs?
>
>One way to answer that is to note that the lists ab, sc, jp, ph,
>which have much lower FP rates than ws (measured by the
>SpamAssassin corpora checks, for example, and also anecdotally by
>human FP reports) appear relatively infrequently in the DMOZ
>hits.  In other words, SURBL lists that we know are quite spammy
>like sc, jp, etc. don't match DMOZ often, so the DMOZ data may
>not have too many spam domains.
>
>Similar tests could be done against other proposed whitelists.
>(We'll probably try the wikipedia data next.)
>
>Another concern is that since these directories are relatively
>open, spammers could simply add themselves and effectively get
>whitelisted.  However I intend to take a snapshot of these and
>probably not try to refresh the data very often in future (it at
>all), instead using them as relatively static snapshots of
>established domains.  Doing that would miss some new additions,
>but could also prevent some future abuse by spammers.  On the
>other hand 2 million domains is a pretty good start....  :-)
>
>Extraction scripts are not perfect, particularly in the
>simplistic chopping to three levels of cctlds, but they're
>probably adequate:
>
>  http://spamcheck.freeapp.net/whitelists/extract-dmoz-domains
>  http://spamcheck.freeapp.net/whitelists/chop-two-level-domains.sed
>  http://spamcheck.freeapp.net/whitelists/reduce-to-third-level.sed
>
>Comments please,
>
>Jeff C.
>
>