I'd just like to summarize the current position with regard to url types which are not currently parsed correctly by sa and ask for some help with tests using version 3.
Yahoo offers a public redirection service. You can enter a url like this: http://rds.yahoo.com/*http://www.google.com and you get sent to www.google.com. (By the way I'm not sure what the point of this is, because unlike tinyurl.com the yahoo url is longer. However it sure comes in handy to spammers who are trying to get past sa URI rulesets.)
Spam which is not picked up correctly by sa uri filters often contains redirection urls, even though the redirected domain is in sc.surbl.org. Jeff Chan has opened a bug against URIDNSBL.pm to ask for support for parsing out the spammer domain from redirected urls. http://bugzilla.spamassassin.org/show_bug.cgi?id=3261
Things are getting more complicated, because spam coming through seems to contain features which avoid it being picked up even by an altered parser which strips off the http://rds.yahoo.com/* part.
I wanted to make a summary of current understanding of the url types which break parsing. I've tested these with SpamCopURI and ver 2.63. If someone offers to test (from case 2 onwards) with URIDNSBL and version 3, I'll post suitable test cases.
1.http://rds.yahoo.com/*http://spammer.domain.tld/aaaaaaaaaa (bug 3261) Workaround in PerMsgStatus.pm: $uri =~ s/^http://(?:drs|rd).yahoo.com/[^*]+*(.*)$/$1/g;
2.http://rds.yahoo.com/*%68ttp://spammer.domain.tld/aaaaaaaa (follow-up to bug 3261 including test case) (the other possible variations on this which I haven't seen as yet can use %NN instead of any or all the 'http' characters in the redirected domain. e.g. http://rds.yahoo.com/*%68%74%74%70://spammer.domain.tld/aaaaaaaa
Workaround in PerMsgStatus.pm: $uri =~ s/%68/h/g; $uri =~ s/%74/t/g; $uri =~ s/%70/p/g;
3. http://rd.yahoo.com/winery/college/banbury/*http:/len= derserv.com?partid=3Darlenders
The redirect url is formally incorrect (there is a single slash after http) but browsers have no problem with this. The parser cannot handle it.
Workaround in PerMsgStatus.pm: $uri =~ s/http:/([^/])/http://$1/g;
By the way, this url contains 'quotable printable' characters ('= newline' and '=3d') which are not causing problems to the parser. Neither is the absence of a trailing slash before the ? causing problems in parsing.
4. URLS without http: in front of them. The following seen in a browser reads: "Please copy and paste this link into your browser healthyexchange.biz "
<p> P<advisory>l<aboveboard>e<compose>a<geochronology>s<moral>e<palfrey> <rada= r>c<symptomatic>o<yankee>p<conduit>y<souffle> <intake>a<arise>n<eocene>d <= thickish>paste <impact>this <broadloom>link <road>i<dichotomous>n<quinine>= t<scoreboard>o y<eager>o<impact>ur b<archenemy>r<band>o<wallop>wser <b> he= althyexchange.biz</b>
Probably not much that can be dones with this.
5. http://http://www.eager-18.com/_7953f10b575a18d044cdec5a40bd4f22//?d=vision Here the double http prevents this being parsed. (OK it wasn't in sc.surbl.org but even if it was it wouldn't have been picked up)
Workaround in PerMsgStatus.pm: $uri =~ s/http://http:///http:///g;
John
Thanks for the rules fodder!
BTW, msn also has an open redirector that is seeing much use:
uri LWTEST_REDIRECT1 m'http://g.msn.com/0AD0000%5BA-Z%5D/%5Cd%7B6%7D%5C.1%5B/%5C?%5D%27i describe LWTEST_REDIRECT1 Open MSN redirector found in URL
Loren
----- Original Message ----- From: "John Fawcett" johnml@michaweb.net To: spamassassin-dev@incubator.apache.org; discuss@lists.surbl.org Sent: Saturday, April 17, 2004 3:22 AM Subject: [long] summary of currently unparsed url types
I'd just like to summarize the current position with regard to url types which are not currently parsed correctly by sa and ask for some help with tests using version 3.
Yahoo offers a public redirection service. You can enter a url like this: http://rds.yahoo.com/*http://www.google.com and you get sent to www.google.com. (By the way I'm not sure what the
point
of this is, because unlike tinyurl.com the yahoo url is longer. However it sure comes in handy to spammers who are trying to get past sa URI rulesets.)
Spam which is not picked up correctly by sa uri filters often contains redirection urls, even though the redirected domain is in sc.surbl.org.
Jeff
Chan has opened a bug against URIDNSBL.pm to ask for support for parsing
out
the spammer domain from redirected urls. http://bugzilla.spamassassin.org/show_bug.cgi?id=3261
Things are getting more complicated, because spam coming through seems to contain features which avoid it being picked up even by an altered parser which strips off the http://rds.yahoo.com/* part.
I wanted to make a summary of current understanding of the url types which break parsing. I've tested these with SpamCopURI and ver 2.63. If someone offers to test (from case 2 onwards) with URIDNSBL and version 3, I'll post suitable test cases.
1.http://rds.yahoo.com/*http://spammer.domain.tld/aaaaaaaaaa (bug 3261) Workaround in PerMsgStatus.pm: $uri =~ s/^http://(?:drs|rd).yahoo.com/[^*]+*(.*)$/$1/g;
2.http://rds.yahoo.com/*%68ttp://spammer.domain.tld/aaaaaaaa (follow-up to bug 3261 including test case) (the other possible variations on this which I haven't seen as yet can use %NN instead of any or all the 'http' characters in the redirected domain. e.g. http://rds.yahoo.com/*%68%74%74%70://spammer.domain.tld/aaaaaaaa
Workaround in PerMsgStatus.pm: $uri =~ s/%68/h/g; $uri =~ s/%74/t/g; $uri =~ s/%70/p/g;
derserv.com?partid=3Darlenders
The redirect url is formally incorrect (there is a single slash after http) but browsers have no problem with this. The parser cannot handle it.
Workaround in PerMsgStatus.pm: $uri =~ s/http:/([^/])/http://$1/g;
By the way, this url contains 'quotable printable' characters ('= newline' and '=3d') which are not causing problems to the parser. Neither is the absence of a trailing slash before the ? causing problems in parsing.
- URLS without http: in front of them. The following seen in a browser
reads: "Please copy and paste this link into your browser healthyexchange.biz "
<p> P<advisory>l<aboveboard>e<compose>a<geochronology>s<moral>e<palfrey>
<rada=
r>c<symptomatic>o<yankee>p<conduit>y<souffle> <intake>a<arise>n<eocene>d
<=
thickish>paste <impact>this <broadloom>link
<road>i<dichotomous>n<quinine>=
t<scoreboard>o y<eager>o<impact>ur b<archenemy>r<band>o<wallop>wser <b>
he=
althyexchange.biz</b>
Probably not much that can be dones with this.
http://http://www.eager-18.com/_7953f10b575a18d044cdec5a40bd4f22//?d=vision
Here the double http prevents this being parsed. (OK it wasn't in sc.surbl.org but even if it was it wouldn't have been picked up)
Workaround in PerMsgStatus.pm: $uri =~ s/http://http:///http:///g;
John
At 22:22 17/04/2004, John Fawcett wrote:
I'd just like to summarize the current position with regard to url types which are not currently parsed correctly by sa and ask for some help with tests using version 3.
Yahoo offers a public redirection service. You can enter a url like this: http://rds.yahoo.com/*http://www.google.com and you get sent to www.google.com. (By the way I'm not sure what the point of this is, because unlike tinyurl.com the yahoo url is longer. However it sure comes in handy to spammers who are trying to get past sa URI rulesets.)
Spam which is not picked up correctly by sa uri filters often contains redirection urls, even though the redirected domain is in sc.surbl.org. Jeff Chan has opened a bug against URIDNSBL.pm to ask for support for parsing out the spammer domain from redirected urls. http://bugzilla.spamassassin.org/show_bug.cgi?id=3261
Things are getting more complicated, because spam coming through seems to contain features which avoid it being picked up even by an altered parser which strips off the http://rds.yahoo.com/* part.
I wanted to make a summary of current understanding of the url types which break parsing. I've tested these with SpamCopURI and ver 2.63. If someone offers to test (from case 2 onwards) with URIDNSBL and version 3, I'll post suitable test cases.
1.http://rds.yahoo.com/*http://spammer.domain.tld/aaaaaaaaaa (bug 3261) Workaround in PerMsgStatus.pm: $uri =~ s/^http://(?:drs|rd).yahoo.com/[^*]+*(.*)$/$1/g;
2.http://rds.yahoo.com/*%68ttp://spammer.domain.tld/aaaaaaaa (follow-up to bug 3261 including test case) (the other possible variations on this which I haven't seen as yet can use %NN instead of any or all the 'http' characters in the redirected domain. e.g. http://rds.yahoo.com/*%68%74%74%70://spammer.domain.tld/aaaaaaaa
Workaround in PerMsgStatus.pm: $uri =~ s/%68/h/g; $uri =~ s/%74/t/g; $uri =~ s/%70/p/g;
derserv.com?partid=3Darlenders
The redirect url is formally incorrect (there is a single slash after http) but browsers have no problem with this. The parser cannot handle it.
Workaround in PerMsgStatus.pm: $uri =~ s/http:/([^/])/http://$1/g;
By the way, this url contains 'quotable printable' characters ('= newline' and '=3d') which are not causing problems to the parser. Neither is the absence of a trailing slash before the ? causing problems in parsing.
- URLS without http: in front of them. The following seen in a browser
reads: "Please copy and paste this link into your browser healthyexchange.biz "
<p> P<advisory>l<aboveboard>e<compose>a<geochronology>s<moral>e<palfrey> <rada= r>c<symptomatic>o<yankee>p<conduit>y<souffle> <intake>a<arise>n<eocene>d <= thickish>paste <impact>this <broadloom>link <road>i<dichotomous>n<quinine>= t<scoreboard>o y<eager>o<impact>ur b<archenemy>r<band>o<wallop>wser <b> he= althyexchange.biz</b>
Probably not much that can be dones with this.
http://http://www.eager-18.com/_7953f10b575a18d044cdec5a40bd4f22//?d=vision Here the double http prevents this being parsed. (OK it wasn't in sc.surbl.org but even if it was it wouldn't have been picked up)
Workaround in PerMsgStatus.pm: $uri =~ s/http://http:///http:///g;
Just wondering whether its a good idea putting so many highly specific workarounds in for current redirection techniques and sites ? Wouldn't it be better to try and handle most cases more generically ? Otherwise we're forever playing catchup with the spammers...
It seems like most cases could be caught by first decoding % escaped characters, then any quoted printable characters, then trying to extract EACH visible URL-like string out and testing them seperately. (So a single URL may or may not result in two URI's to test)
So for example in Case 1, both rds.yahoo.com and www.google.com would get tested, but thats ok, as its better to err on the safe side.
Case 2 would be automatically handled because the % escaped code would be decoded first before parsing.
Case 3 probably needs a specific workaround.
Case 4 may not be too much of a problem, as most people are unlikely to go to the trouble of copying and pasting a link to follow the spam. If its not just a matter of clicking most people would be too lazy to follow it, so that kind of spam would be unlikely to flourish. (He says hopefully :)
Case 5 may need a specific workaround, or just a change in the way the parser works.
I havn't looked at the existing code (and probably couldn't understand it anyway since its in perl :) but if it's just based on regular expressions it may not be sufficient to reliably extract two URI's from one URL, it probably needs an algorithm rather than just a regex...
Regards, Simon
----- Original Message ----- From: "Simon Byrnand"
Just wondering whether its a good idea putting so many highly specific workarounds in for current redirection techniques and sites ? Wouldn't it be better to try and handle most cases more generically ? Otherwise we're forever playing catchup with the spammers...
You're absolutely right. I am hoping that the seasoned SA and perl developers will come up with suitable code revisions for version 3 of Mail::SpamAssassin.
One suggestion (see http://bugzilla.spamassassin.org/show_bug.cgi?id=3261) was to use a configuration file parameter for redirection services, which looks promising in terms of future flexibility.
For the future revisions to the url parsing code, it's important to take into account our current knowledge of urls which are failing to be parsed. This is my main reason for summarizing them. I would be arrogant (not having ever written a line of perl before 2 days ago) to believe that the workarounds are real solutions.
The more pressing task at the moment is to actually verify that the examples I collected as failing in version 2.63 really do not work with version 3, and then to open bug reports/RFEs so that they can be officially logged as open sa issues. At the moment only case 1 is open. Is anyone in a position to do this?
Also, it will be interesting to continue monitoring the characteristics of urls that are going undetected by sa and feed them back to the sa developer list.
John
At 11:43 19/04/2004, John Fawcett wrote:
Just wondering whether its a good idea putting so many highly specific workarounds in for current redirection techniques and sites ? Wouldn't it be better to try and handle most cases more generically ? Otherwise we're forever playing catchup with the spammers...
You're absolutely right. I am hoping that the seasoned SA and perl developers will come up with suitable code revisions for version 3 of Mail::SpamAssassin.
One suggestion (see http://bugzilla.spamassassin.org/show_bug.cgi?id=3261) was to use a configuration file parameter for redirection services, which looks promising in terms of future flexibility.
For the future revisions to the url parsing code, it's important to take into account our current knowledge of urls which are failing to be parsed. This is my main reason for summarizing them.
Fair enough. Hopefully someone can come up with a more generic way of processing it, as I think most cases of redirection can be handled fairly generically, if we're able to extract multiple URI's from one URL. Apart from % encoding, in all cases I've seen so far (yahoo, msn) the final URL is pretty much out in the clear.
I think it will take more than a bunch of regular expressions to handle it though, it will need a custom written algorithm with a little bit of intelligence which can make a few basic deductions based on what we know so far about the different techniques.
The more pressing task at the moment is to actually verify that the examples I collected as failing in version 2.63 really do not work with version 3, and then to open bug reports/RFEs so that they can be officially logged as open sa issues. At the moment only case 1 is open. Is anyone in a position to do this?
Not me unfortunately, since I run 2.63, but I would be able to test patches to the SpamCopURI plugin.
Also, it will be interesting to continue monitoring the characteristics of urls that are going undetected by sa and feed them back to the sa developer list.
I should collate a list of sample redirected URL's, I'll see if I can find any not already mentioned...
Regards, Simon