[SURBL-Discuss] Took top percentiles of DMOZ and wikipedia domains, some results

Rob McEwen rob at powerviewsystems.com
Sat Oct 9 15:27:00 CEST 2004


Jeff,

As usual, this stuff is moving so fast, I can hardly keep up.

The following was the **original** list of Whitelist candidates:

http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt

Can you give us an update as to which of these have now been whitelisted,
which are still candidates for whitelisting, and which have been decided
against whitelisting. I just want to make sure that the original list wasn't
whitelisted across the board... and if it was, I'd like to know so that I
can make better decisions about which of these to keep these in my private
blocklist.

Also, while testing against this list, I did find a couple of FPs (as I
previously reported). However, I also found much spam. Would it still be
helpful for me to post to this discussion a list of those URIs which DID
catch spam (from the original list)? Instead, I've been making it my
priority to find and report FPs... but don't let my lack of reporting spam
"hits" fool you. This original list did catch many spams.

Or, perhaps I should move on and just start fresh applying this same kind of
testing to the new list below:

http://spamcheck.freeapp.net/whitelists/percentile-wikipedia-dmoz-blocklist.
txt

I do notice some overlap between the two lists.

Sorry this post was "all over the map"... just answer as best you can and
we'll go from there.

Thanks,

Rob McEwen

-----Original Message-----
From: discuss-bounces at lists.surbl.org
[mailto:discuss-bounces at lists.surbl.org] On Behalf Of Jeff Chan
Sent: Saturday, October 09, 2004 6:52 AM
To: SURBL Discuss
Subject: [SURBL-Discuss] Took top percentiles of DMOZ and wikipedia
domains,some results

I took the top 50th percentile of the multiple-version-intersected
DMOZ domains and matched them with the top 70th percentile of the
multiple-version-intersected wikipedia domains:

  http://spamcheck.freeapp.net/whitelists/dmoz-50thpercentile.srt

  http://spamcheck.freeapp.net/whitelists/wikipedia-70thpercentile.srt

resulting in a list of about 15k domains:

  http://spamcheck.freeapp.net/whitelists/percentile-wikipedia-dmoz.srt

(first column is word counts:)

  255838  255838 3955772 dmoz-50thpercentile.srt
   28323   28323  402512 wikipedia-70thpercentile.srt
   14982   14982  202925 percentile-wikipedia-dmoz.srt

that appeared at least two or three times in both data sources.
(Percentiles were chosen to give about two or three hits of
domains within the same source.)  I then matched those 15k
against the list of all SURBL domains and got the following
hits, all possible FPs:

 
http://spamcheck.freeapp.net/whitelists/percentile-wikipedia-dmoz-blocklist.
txt

0catch.com
1asphost.com
741.com
8bit.co.uk
8m.net
anzwers.org
arena.ne.jp
away.com
centralhome.com
cheapass.com
f2g.net
faithweb.com
fateback.com
fortunecity.de
freewebpage.org
galeon.com
htmlplanet.com
i8.com
itgo.com
iwarp.com
kit.net
kki.net.pl
ledger-enquirer.com
nana.co.il
online-dictionary.biz
p5.org.uk
quuxuum.org
republika.pl
s5.com
spaceports.com
t35.com
telepolis.com
transnationale.org
up.co.il
xiloo.com
zip.net
zonai.com

One additional test I'd like to apply to all these data is to
remove any that are listed in SBL, but I haven't coded that up
yet.  However Ryan Thompson's GetURI does include an SBL check,
along with other goodies like domain age, so I fed these into
his CGI version, with the results at:

http://ry.ca/cgi-bin/geturi.cgi?id=ham-es0EnYAUmBru8HxYCzvQ5x

It looks like all are between three and ten years old, aside
from:

fortunecity.de         2.7 years old     16 NANAS
online-dictionary.biz  214 days           0 NANAS
nana.co.il             552 days         754 NANAS
1asphost.com           786 days          60 NANAS

And only these two had SBL hits:

8bit.co.uk            1796 days         131 NANAS
xiloo.com             1667 days         342 NANAS

Aside from those two, the rest may be candidates for
whitelisting, though I did not check them further.
(Note also that GetURI does not count NANAS; I did those
few manually.

May I ask for some help in checking these?


Note that we should still continue to check the DMOZ hits since
there are probably some more FPs in there also:

  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.summed.txt
  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt
  http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws

    1338   13380  141946 dmoz-blocklist1.summed.txt
    1338    1338   20533 dmoz-blocklist1.txt
    1173   11730  124298 dmoz-blocklist1.ws

Most are in WS.

Jeff C.
--
"If it appears in hams, then don't list it."

_______________________________________________
Discuss mailing list
Discuss at lists.surbl.org
http://lists.surbl.org/mailman/listinfo/discuss




More information about the Discuss mailing list