[SURBL-Discuss] Took top percentiles of DMOZ and wikipedia domains, some results

Jeff Chan jeffc at surbl.org
Sat Oct 9 17:22:08 CEST 2004


OK for completeness, or to thoroughly compound the confusion,  ;-)
here are joins of the (much smaller) percentiled dmoz and
wikipedia lists:

http://spamcheck.freeapp.net/whitelists/dmoz-50thpercentile.srt

  255838  255838 3955772 dmoz-50thpercentile.srt

http://spamcheck.freeapp.net/whitelists/wikipedia-70thpercentile.srt

   28323   28323  402512 wikipedia-70thpercentile.srt

against the SURBL whitelist and blocklist domains (and WS):

http://spamcheck.freeapp.net/whitelists/dmoz-50thpercentile-whitelist.txt
http://spamcheck.freeapp.net/whitelists/dmoz-50thpercentile-blocklist.txt
http://spamcheck.freeapp.net/whitelists/dmoz-50thpercentile-blocklist.summed.txt
http://spamcheck.freeapp.net/whitelists/dmoz-50thpercentile-blocklist.ws

    2962    2962   36518 dmoz-50thpercentile-whitelist.txt
     236     236    3312 dmoz-50thpercentile-blocklist.txt
     236    2360   24355 dmoz-50thpercentile-blocklist.summed.txt
     233    2330   24044 dmoz-50thpercentile-blocklist.ws

http://spamcheck.freeapp.net/whitelists/wikipedia-70thpercentile-whitelist.txt
http://spamcheck.freeapp.net/whitelists/wikipedia-70thpercentile-blocklist.txt
http://spamcheck.freeapp.net/whitelists/wikipedia-70thpercentile-blocklist.summed.txt
http://spamcheck.freeapp.net/whitelists/wikipedia-70thpercentile-blocklist.ws

    1260    1260   14702 wikipedia-70thpercentile-whitelist.txt
      47      47     574 wikipedia-70thpercentile-blocklist.txt
      47     470    4685 wikipedia-70thpercentile-blocklist.summed.txt
      45     450    4471 wikipedia-70thpercentile-blocklist.ws

One reason I didn't mention these before is because they're kind
of mid-way between the larger lists and the smaller one combining
them all (with 37 records), so I didn't want to focus on them.


For comparison purposes, the percentiled lists are much smaller
than the non-percentiled ones, because there are many domains in
each corpus with only one entry.  Here are the original
(un-percentiled) sizes compared with the percentiled ones:

http://spamcheck.freeapp.net/whitelists/dmoz.srt
http://spamcheck.freeapp.net/whitelists/dmoz-50thpercentile.srt

 2300851 2300851 38065969 dmoz.srt
  255838  255838 3955772 dmoz-50thpercentile.srt

http://spamcheck.freeapp.net/whitelists/wikipedia.srt
http://spamcheck.freeapp.net/whitelists/hpercentile.srt

  173828  173828 2633441 wikipedia.srt
   28323   28323  402512 wikipedia-70thpercentile.srt

So you can see why the matches of the percentiled data against
SURBLs are fewer.

Jeff C.
--
"If it appears in hams, then don't list it."



More information about the Discuss mailing list