[SURBL-Discuss] Took top percentiles of DMOZ and wikipedia domains, some results

Jeff Chan jeffc at surbl.org
Sat Oct 9 12:52:00 CEST 2004


I took the top 50th percentile of the multiple-version-intersected
DMOZ domains and matched them with the top 70th percentile of the
multiple-version-intersected wikipedia domains:

  http://spamcheck.freeapp.net/whitelists/dmoz-50thpercentile.srt

  http://spamcheck.freeapp.net/whitelists/wikipedia-70thpercentile.srt

resulting in a list of about 15k domains:

  http://spamcheck.freeapp.net/whitelists/percentile-wikipedia-dmoz.srt

(first column is word counts:)

  255838  255838 3955772 dmoz-50thpercentile.srt
   28323   28323  402512 wikipedia-70thpercentile.srt
   14982   14982  202925 percentile-wikipedia-dmoz.srt

that appeared at least two or three times in both data sources.
(Percentiles were chosen to give about two or three hits of
domains within the same source.)  I then matched those 15k
against the list of all SURBL domains and got the following
hits, all possible FPs:

  http://spamcheck.freeapp.net/whitelists/percentile-wikipedia-dmoz-blocklist.txt

0catch.com
1asphost.com
741.com
8bit.co.uk
8m.net
anzwers.org
arena.ne.jp
away.com
centralhome.com
cheapass.com
f2g.net
faithweb.com
fateback.com
fortunecity.de
freewebpage.org
galeon.com
htmlplanet.com
i8.com
itgo.com
iwarp.com
kit.net
kki.net.pl
ledger-enquirer.com
nana.co.il
online-dictionary.biz
p5.org.uk
quuxuum.org
republika.pl
s5.com
spaceports.com
t35.com
telepolis.com
transnationale.org
up.co.il
xiloo.com
zip.net
zonai.com

One additional test I'd like to apply to all these data is to
remove any that are listed in SBL, but I haven't coded that up
yet.  However Ryan Thompson's GetURI does include an SBL check,
along with other goodies like domain age, so I fed these into
his CGI version, with the results at:

http://ry.ca/cgi-bin/geturi.cgi?id=ham-es0EnYAUmBru8HxYCzvQ5x

It looks like all are between three and ten years old, aside
from:

fortunecity.de         2.7 years old     16 NANAS
online-dictionary.biz  214 days           0 NANAS
nana.co.il             552 days         754 NANAS
1asphost.com           786 days          60 NANAS

And only these two had SBL hits:

8bit.co.uk            1796 days         131 NANAS
xiloo.com             1667 days         342 NANAS

Aside from those two, the rest may be candidates for
whitelisting, though I did not check them further.
(Note also that GetURI does not count NANAS; I did those
few manually.

May I ask for some help in checking these?


Note that we should still continue to check the DMOZ hits since
there are probably some more FPs in there also:

  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.summed.txt
  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt
  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws

    1338   13380  141946 dmoz-blocklist1.summed.txt
    1338    1338   20533 dmoz-blocklist1.txt
    1173   11730  124298 dmoz-blocklist1.ws

Most are in WS.

Jeff C.
--
"If it appears in hams, then don't list it."



More information about the Discuss mailing list