[SURBL-Discuss] More news on unparsed urls

John Fawcett johnml at michaweb.net
Sun Apr 25 14:40:34 CEST 2004

I have tested SpamCopURI 0.14 and SA 2.63 with my
collection of unparsed urls. This new version deals
with many of the cases, so that the ugly workarounds
I was using can be removed.

By the way, if you're reading this Eric, it might be
worthwhile adding ads.msn.com and g.msn.com
to the list of known redirection services in the
sample spamcop_uri.cf.

Here are the cases that are not picked up:

1. URLs that aren't URLs (missing protocol, even
missing www )

P<advisory>l<aboveboard>e<compose>a<geochronology>s<moral>e<palfrey> <rada=
r>c<symptomatic>o<yankee>p<conduit>y<souffle> <intake>a<arise>n<eocene>d <=
thickish>paste <impact>this <broadloom>link <road>i<dichotomous>n<quinine>=
t<scoreboard>o y<eager>o<impact>ur b<archenemy>r<band>o<wallop>wser <b> he=

2. Double protocol


Workaround in PerMsgStatus.pm

    $uri =~ s/http:\/\/http:\/\//http:\/\//gi;

(NB from the previously published workaround I added case insensitivity)

3. HTML escape sequences in URL


Workaround in PerMsgStatus.pm

        $_ = HTML::Entities::decode($_);
        use HTML::Entities;

(NB from the previously published workaround this is different
because it does the conversion earlier on and so takes into
account that http could also be coded with escape sequences.
It seems to work despite the comment
to not modify $_ in get_uri_list.)

Here's a diff of PerMSgStatus.pm with SpamCopURI 0.14
compared to the version with the workarounds mentioned


diff -u PerMsgStatus.pm.orig PerMsgStatus.pm

--- PerMsgStatus.pm.orig        2004-04-25 12:50:05.000000000 +0200
+++ PerMsgStatus.pm     2004-04-25 13:01:11.000000000 +0200
@@ -44,6 +44,7 @@
 use Mail::SpamAssassin::Conf;
 use Mail::SpamAssassin::Received;
 use Mail::SpamAssassin::Util;
+use HTML::Entities;

 use constant HAS_MIME_BASE64 =>                eval { require
MIME::Base64; };

@@ -1748,6 +1749,7 @@

   for (@$textary) {
     # NOTE: do not modify $_ in this loop
+    $_ = HTML::Entities::decode($_);
     while (/($uriRe)/go) {
       my $uri = $1;

@@ -1776,6 +1778,7 @@
           $uri = "${base_uri}$uri";
+      $uri =~ s/http:\/\/http:\/\//http:\/\//gi;

       # warn("Got URI: $uri\n");
       push @uris, $uri;

