-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jeff Chan writes:
We are also looking into some other potential spam URI data sources such as proxypots, etc.:
Jeff --
a quick note on this; it has to be done very carefully. Many spammers are using "link poisoning" stuff like this:
Get ov<A href="http://www.gimbel.org"></A>er 300 medicat<B><FONT size=3>l</FONT></B>ons online sh<B><FONT size=3>l</FONT></B>pp<A href="http://www.omniscient.com"></A>ed over<A href="http://www.proton.net"></A>nig<A href="http://www.cravet.org"></A>ht to your fr<A href="http://www.aristotelean.org"></A>ont do<A href="http://www.barnacle.com"></A>or with no pr<A href="http://www.lordosis.net"></A>escr<B><FONT size=3>l</FONT></B>ption.</FONT>
All of those are "www.{RANDOMWORD}.{com|net|org}". Eventually there's one real link, which *is* SURBL-listed. These are chaff.
Now, SORBS for one seems to be listing some of these sites; presumably because they have a spamtrap-driven feed without enough human moderation. That's the danger here.
(btw, there's arguments to be made that a better selection mechanism can "weed those out", but that needs to be careful too.
- - Ignore .org/.net/.com? spammer will use .biz, .info, and ccTLDs. - - Ignore 0-length links (<a href=...></a>)? spammer will change to use <a href=...>{RANDOMWORD}</a>. - - Ignore "dictionary words" somehow? spammer will use random URLs from google, so "real" sites.
so I don't think those approaches have much merit alone.)
- --j.