Chris Santerre csanterre@MerchantsOverseas.com writes:
Wow, it looks like some of the DMOZ data can't be trusted. Some of those domains in this WS blocklist are pure spammers.
DMOZ (and as far as I know, Wikipedia) don't filter URLs based on email policies of those sites. However, the links *should* generally be categorized correctly in the case of DMOZ and useful in the case of Wikipedia.
I would not suggest using either to whitelist automatically, but if you get several of these sources and count the number of hits for each domain, then you should be able to prioritize and possibly automatically whitelist the ones that hit in a large number of databases.
I would also take snapshots, but for a different reason than the one Jeff suggested. I would take snapshots and take the intersection of two snapshots for each source (two separate days of DMOZ, etc.) as the authoritative list since some spammer links (especially if added by some bot) will drop off once they are found.
Clearly, given that most of the hits are in .ws etc. you're in the tail region of false positives. It'll be hard to find a lot. More sources and looking at source counts seems like the best way.
Daniel
On Wednesday, October 6, 2004, 11:26:33 AM, Daniel Quinlan wrote:
I would not suggest using either to whitelist automatically, but if you get several of these sources and count the number of hits for each domain, then you should be able to prioritize and possibly automatically whitelist the ones that hit in a large number of databases.
Let us know if you think of any others. dmoz and wikipedia hadn't occurred to me before.
Can anyone think of any other large, hand-built or checked directories or databases of (legitimate) URIs?
Is it possible to pull URIs out of semantic webs?
I would also take snapshots, but for a different reason than the one Jeff suggested. I would take snapshots and take the intersection of two snapshots for each source (two separate days of DMOZ, etc.) as the authoritative list since some spammer links (especially if added by some bot) will drop off once they are found.
Those are all good ideas. Do you know if spammer links do get deleted? How do the folks who maintain the sites find abusers or bots?
Jeff C. -- "If it appears in hams, then don't list it."