[SURBL-Discuss] Took top percentiles of DMOZ and wikipedia domains, some results

Jeff Chan jeffc at surbl.org
Sat Oct 9 16:43:37 CEST 2004

On Saturday, October 9, 2004, 6:27:00 AM, Rob McEwen wrote:
> As usual, this stuff is moving so fast, I can hardly keep up.

> The following was the **original** list of Whitelist candidates:

> http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt

Actually, these are the revised matches from October 8 with the
domains intersected (joined) across three different snapshots
of DMOZ data from a time period spanning about a month.  Since
that intersection was joined against SURBLs a couple days later
than the original one, this later version has a few whitelisted
records already removed.

Probably confusingly, I renamed the original one (from October
6), based on a single snapshot of dmoz, to:


(It's reasonable to assume that would be a later version, but
it's an older one.  I usually give the latest file the original
name, and add a revision number to the name of the old version
with the number incremented as each "current" one gets archived,
i.e. the previous current one would become dmoz-blocklist2.txt
when a new current one replaces it.)

I hope there isn't too much version confusion, but most of the
matches remain unchanged regardless.

> Can you give us an update as to which of these have now been whitelisted,
> which are still candidates for whitelisting, and which have been decided
> against whitelisting. I just want to make sure that the original list wasn't
> whitelisted across the board... and if it was, I'd like to know so that I
> can make better decisions about which of these to keep these in my private
> blocklist.

No bulk lists have been whitelisted, only individual FPs
specifically checked and reported based on the matches.
Certainly those whitelisted FPs are only a small faction
of the DMOZ matches so far.  Therefore most of the matches
in the current version still need to be checked.

> Also, while testing against this list, I did find a couple of FPs (as I
> previously reported). However, I also found much spam. Would it still be
> helpful for me to post to this discussion a list of those URIs which DID
> catch spam (from the original list)?

Yes.  It's useful to know how spammy the DMOZ domains are.

> Instead, I've been making it my
> priority to find and report FPs... but don't let my lack of reporting spam
> "hits" fool you. This original list did catch many spams.

Both spams and FPs are of interest.  FPs are probably more urgent
to detect and get out of the data however.

> Or, perhaps I should move on and just start fresh applying this same kind of
> testing to the new list below:

> http://spamcheck.freeapp.net/whitelists/percentile-wikipedia-dmoz-blocklist.
> txt

This is a much smaller list with only 37 matches between the
percentiled (much smaller) wikipedia and dmoz lists and SURBLs.
Given that it has tighter inclusion criteria, I think it would
be good to focus on finding FPs to whitelist in these first,
then go back to the larger list of DMOZ matches.

Hope this helps, and thanks much for your help.  Multiple opinions
on these 37 would be welcomed.  Comparing notes could be useful.

Jeff C.
"If it appears in hams, then don't list it."

