I have tested SpamCopURI 0.14 and SA 2.63 with my collection of unparsed urls. This new version deals with many of the cases, so that the ugly workarounds I was using can be removed.
By the way, if you're reading this Eric, it might be worthwhile adding ads.msn.com and g.msn.com to the list of known redirection services in the sample spamcop_uri.cf.
Here are the cases that are not picked up:
1. URLs that aren't URLs (missing protocol, even missing www )
<p> P<advisory>l<aboveboard>e<compose>a<geochronology>s<moral>e<palfrey> <rada= r>c<symptomatic>o<yankee>p<conduit>y<souffle> <intake>a<arise>n<eocene>d <= thickish>paste <impact>this <broadloom>link <road>i<dichotomous>n<quinine>= t<scoreboard>o y<eager>o<impact>ur b<archenemy>r<band>o<wallop>wser <b> he= althyexchange.biz</b>
2. Double protocol
http://http://www.eager-18.com/_7953f10b575a18d044cdec5a40bd4f22//?d=vision
Workaround in PerMsgStatus.pm
$uri =~ s/http://http:///http:///gi;
(NB from the previously published workaround I added case insensitivity)
3. HTML escape sequences in URL
http://toform.net/mcp/879/1352/cap112.html
Workaround in PerMsgStatus.pm
$_ = HTML::Entities::decode($_); use HTML::Entities;
(NB from the previously published workaround this is different because it does the conversion earlier on and so takes into account that http could also be coded with escape sequences. It seems to work despite the comment to not modify $_ in get_uri_list.)
Here's a diff of PerMSgStatus.pm with SpamCopURI 0.14 compared to the version with the workarounds mentioned above.
John
diff -u PerMsgStatus.pm.orig PerMsgStatus.pm
-----------cut------------- --- PerMsgStatus.pm.orig 2004-04-25 12:50:05.000000000 +0200 +++ PerMsgStatus.pm 2004-04-25 13:01:11.000000000 +0200 @@ -44,6 +44,7 @@ use Mail::SpamAssassin::Conf; use Mail::SpamAssassin::Received; use Mail::SpamAssassin::Util; +use HTML::Entities;
use constant HAS_MIME_BASE64 => eval { require MIME::Base64; };
@@ -1748,6 +1749,7 @@
for (@$textary) { # NOTE: do not modify $_ in this loop + $_ = HTML::Entities::decode($_); while (/($uriRe)/go) { my $uri = $1;
@@ -1776,6 +1778,7 @@ $uri = "${base_uri}$uri"; } } + $uri =~ s/http://http:///http:///gi;
# warn("Got URI: $uri\n"); push @uris, $uri; -----------------cut---------------