[SURBL-Discuss] Re: finding RegEx for extracting URIs

Jeff Chan jeffc at surbl.org
Mon May 17 19:09:20 CEST 2004

Hello Rob,
I'm forwarding your message to the SURBL discussion and
SpamAssassin developers lists.

Rob is looking for regular expressions to extract URIs form
message bodies.  As we know this is a little more complex than it
may appear at first, given MIME decoding, weird cases, deliberate
obfuscation, etc.

Jeff C.

On Monday, May 17, 2004, 6:01:57 PM, Rob McEwen wrote:
> Jeff:

> Thanks for the resources you provided. At the least, these will give me a
> rough roadmap. The reason that these won't do much more for me than this is
> because I program using a different platform, different filtering software,
> a different mail server, and in an entirely different programming language
> (not perl or php). Some of this stuff does not translate very well and,
> frankly, some of it looks like "greek" to me. 

> I program in Visual Basic.NET using the Microsoft .NET platform. Given that
> I'm also decent at C# programming, I think I could have done better if these
> examples were written in Java, for example.

> Nevertheless, my original question... the search for a really good regular
> expression for extracting URIs... is a VERY platform-independent issue.
> Finding the best regular expression for this could assist a variety of
> people using a variety of programming platforms and programming languages
> who might also attempt to write software to work with SURBL.

> Therefore, please consider posting this question on your website in the
> hopes that someone will provide a solution. I wouldn't be surprised if
> someone has already discovered the best Regular Expression for extracting
> URIs. Also, I believe that posting the answer on your site will be very
> helpful to others in my same predicament.

> (Note that it is easy to find Regular Expressions which extract the full URL
> where the URL is preceded by an "href". But this is obviously not enough.
> I'd like to find one which takes into account the country codes for
> determining whether two or three levels are needed and then only extracts
> what SURBL actually needs. Also, I fully admit that I'd have this one solved
> if Regular Expressions were my specialty. They are not!)

> Thanks for your consideration. Feel free to quote some or all of this e-mail
> at will.

> Rob McEwen
> PowerView Systems
> rob at PowerViewSystems.com
> (478) 475-9032

> -----Original Message-----
> From: Jeff Chan [mailto:jeffc at surbl.org] 
> Sent: Monday, May 17, 2004 5:45 PM
> To: Rob McEwen
> Cc: jeffc at surbl.org; webmaster at PowerViewSystems.com
> Subject: Re: finding RegEx for extracting URIs

> On Monday, May 17, 2004, 5:22:50 AM, Rob McEwen wrote:
>> I'm trying to find a good regular expression for extracting URIs from the
>> raw text of an e-mail message. This would help tremendously for
> programming
>> a SURBL based filter. Do you know of any such regular expression? The ones
>> I've found on the internet so far are not very good, for numerous reasons.

> Hi Rob,
> I think your best best would be to copy the code from urirhsbl in
> SpamAssassin URIDNSBL or SpamCopURI's code:

> http://spamassassin.org/full/3.0.x/dist/lib/Mail/SpamAssassin/Plugin/URIDNSB
> L.pm

>   http://sourceforge.net/projects/spamcopuri/

> Jeff C.

Jeff Chan
mailto:jeffc at surbl.org

More information about the Discuss mailing list