[SURBL-Discuss] Re: Possible large whitelist from DMOZ data

Daniel Quinlan quinlan at pathname.com
Wed Oct 6 20:26:33 CEST 2004


Chris Santerre <csanterre at MerchantsOverseas.com> writes:

> Wow, it looks like some of the DMOZ data can't be trusted. Some of those
> domains in this WS blocklist are pure spammers. 

DMOZ (and as far as I know, Wikipedia) don't filter URLs based on email
policies of those sites.  However, the links *should* generally be
categorized correctly in the case of DMOZ and useful in the case of
Wikipedia.

I would not suggest using either to whitelist automatically, but if you
get several of these sources and count the number of hits for each
domain, then you should be able to prioritize and possibly automatically
whitelist the ones that hit in a large number of databases.

I would also take snapshots, but for a different reason than the one
Jeff suggested.  I would take snapshots and take the intersection of two
snapshots for each source (two separate days of DMOZ, etc.) as the
authoritative list since some spammer links (especially if added by some
bot) will drop off once they are found.

Clearly, given that most of the hits are in .ws etc. you're in the tail
region of false positives.  It'll be hard to find a lot.  More sources
and looking at source counts seems like the best way.

Daniel

-- 
Daniel Quinlan                     ApacheCon! 13-17 November (3 SpamAssassin
http://www.pathname.com/~quinlan/  http://www.apachecon.com/  sessions & more)


More information about the Discuss mailing list