Following up on the earlier check of DMOZ domains against SURBL data, I applied some of Quinlan's suggestions and grabbed different revisions of the DMOZ data and joined (intersected) them in order to eliminate changes due to things like editor- removed spammer/abuser domains, etc. This also means that new additions are ignored, but the corpus is so large that the benefits of capturing editor removals is probably more important.
Snapshots are the three most recent available from 9/9/04, 9/25/04 and 10/7/04 (file dates) are intersected, resulting in fewer records (only those that are constant across all three snapshots):
http://rdf.dmoz.org/rdf/archive/
Here are the line, word and character counts:
2300851 2300851 38065969 dmoz.srt
1169 11690 123909 dmoz-blocklist.summed.txt 1141 11410 120860 dmoz-blocklist.ws 1169 1169 17977 dmoz-blocklist.txt 7394 7394 97011 dmoz-whitelist.txt
The above are revised versions of joins on blocklists, with list info, with ws hits, with just domains, and against the whitelist.
These are in the whitelists directory, though we are still *not* applying the dmoz domains as whitelists:
http://spamcheck.freeapp.net/whitelists/
The previous dmoz (9/25 version only IIRC) and hits are archived as:
2326173 2326173 38494184 dmoz.srt1
1338 13380 141946 dmoz-blocklist1.summed.txt 1173 11730 124298 dmoz-blocklist1.ws 1338 1338 20533 dmoz-blocklist1.txt 7375 7375 96720 dmoz-whitelist1.txt __
I was also able to grab 4 snapshots of the wikipedia, all sections (all languages). Only the two most recent snapshots had comparable numbers of sections so I used only those two. Using only two sets may be ok, since these are much smaller corpora and there's probably more hand-editing of them.
http://download.wikimedia.org/
Where dmoz has about 2.3 million domains, wikipedia has about 174k domains:
173828 173828 2633441 wikipedia.srt
188 1880 19631 wikipedia-blocklist.summed.txt 188 188 2713 wikipedia-blocklist.txt 2437 2437 29581 wikipedia-whitelist.txt __
I also took the intersection of the three dmoz snapshots and the two wikipedia snapshots to get a smaller list containing only the ~102k domains found in both wikipedia and dmoz:
101619 101619 1498653 wikipedia-dmoz.srt
116 1160 11928 wikipedia-dmoz-blocklist.summed.txt 116 116 1591 wikipedia-dmoz-blocklist.txt 2223 2223 26854 wikipedia-dmoz-whitelist.txt
This intersection of dmoz and wikipedia domains probably represents the best hope for large whitelist additions so far. This is probably imperfect data, but at least it has some checks by human editors and techniques applied to reduce spammer domains, including comparing the snapshots over time and intersecting the two relatively unrelated sources.
Can anyone think of any other hand-edited databases, directories, encyclopedias, etc of URIs of hopefully legitimate (non-spammer) domains that are publically available? Please think about it a little, and speak up!
While 102k domains isn't nearly as large as the 2.3M in dmoz, it's certainly more than the 12k or so whitelist records we currently have. How does the intersected list look as a potential whitelist?
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz.srt
Please also take a look at these blocklist hits (potential FPs) and share what you think:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz-blocklist.summed.txt
Would there be many FNs (missed spams) if we whitelisted all of these? In other words are these all truly False Positives? If not, which ones do you feel are true spammers and why.
Jeff C. -- "If it appears in hams, then don't list it."