On Tue, 22 Feb 2005 04:35:51 -0600 (CST), David B Funk dbfunk@engineering.uiowa.edu wrote:
I'm seeing a new spam varient that is clearly designed to get past SURBL. It is an HTML message that contains many (50~100) 'invisible' links; links that have no target text, just: <A href="http://garbage.sitename.tld"></A>
The intention is clear, they want to fill up the 20 'slots' of the spamcop_uri_limit with their junk links so the real "payload" URL can slip past unchecked. That's playing a statistical game, there's a 1 in 20 chance of the "payload" getting picked by the randomizer but that means that 95% slip by.
To add insult to injury, they're tossing in random "\r" (ASCII-CR) characters into the "payload" hostname to try to break spamassasin's URI parsing.
Because of all these games that are played to break the parser, I discussed an idea a while back on the SpamCop newsgroups that looked at using Java (or some other API, maybe with Internet Explorer) to render a spam's HTML into a virtual page and then scan its Document Objects (post HTML parsing) one at a time for links. It's similar to what a user would "see" in a browser.
I've a hunch that "null" links, strange parsing, etc. will be handled correctly by the DOM parser for HTML, but I've never done any tests for lack of time. Java API could be called under linux, but IE's? Just an idea... I'm sure the spammers could figure out how to get around that method, too. But the trick is, their HTML still has to show up correctly to the user for the spam to work.