I took the top 50th percentile of the multiple-version-intersected DMOZ domains and matched them with the top 70th percentile of the multiple-version-intersected wikipedia domains:
http://spamcheck.freeapp.net/whitelists/dmoz-50thpercentile.srt
http://spamcheck.freeapp.net/whitelists/wikipedia-70thpercentile.srt
resulting in a list of about 15k domains:
http://spamcheck.freeapp.net/whitelists/percentile-wikipedia-dmoz.srt
(first column is word counts:)
255838 255838 3955772 dmoz-50thpercentile.srt 28323 28323 402512 wikipedia-70thpercentile.srt 14982 14982 202925 percentile-wikipedia-dmoz.srt
that appeared at least two or three times in both data sources. (Percentiles were chosen to give about two or three hits of domains within the same source.) I then matched those 15k against the list of all SURBL domains and got the following hits, all possible FPs:
http://spamcheck.freeapp.net/whitelists/percentile-wikipedia-dmoz-blocklist....
0catch.com 1asphost.com 741.com 8bit.co.uk 8m.net anzwers.org arena.ne.jp away.com centralhome.com cheapass.com f2g.net faithweb.com fateback.com fortunecity.de freewebpage.org galeon.com htmlplanet.com i8.com itgo.com iwarp.com kit.net kki.net.pl ledger-enquirer.com nana.co.il online-dictionary.biz p5.org.uk quuxuum.org republika.pl s5.com spaceports.com t35.com telepolis.com transnationale.org up.co.il xiloo.com zip.net zonai.com
One additional test I'd like to apply to all these data is to remove any that are listed in SBL, but I haven't coded that up yet. However Ryan Thompson's GetURI does include an SBL check, along with other goodies like domain age, so I fed these into his CGI version, with the results at:
http://ry.ca/cgi-bin/geturi.cgi?id=ham-es0EnYAUmBru8HxYCzvQ5x
It looks like all are between three and ten years old, aside from:
fortunecity.de 2.7 years old 16 NANAS online-dictionary.biz 214 days 0 NANAS nana.co.il 552 days 754 NANAS 1asphost.com 786 days 60 NANAS
And only these two had SBL hits:
8bit.co.uk 1796 days 131 NANAS xiloo.com 1667 days 342 NANAS
Aside from those two, the rest may be candidates for whitelisting, though I did not check them further. (Note also that GetURI does not count NANAS; I did those few manually.
May I ask for some help in checking these?
Note that we should still continue to check the DMOZ hits since there are probably some more FPs in there also:
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.summed.txt http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws
1338 13380 141946 dmoz-blocklist1.summed.txt 1338 1338 20533 dmoz-blocklist1.txt 1173 11730 124298 dmoz-blocklist1.ws
Most are in WS.
Jeff C. -- "If it appears in hams, then don't list it."