Hello Rob, I'm forwarding your message to the SURBL discussion and SpamAssassin developers lists.
All, Rob is looking for regular expressions to extract URIs form message bodies. As we know this is a little more complex than it may appear at first, given MIME decoding, weird cases, deliberate obfuscation, etc.
Jeff C. __
On Monday, May 17, 2004, 6:01:57 PM, Rob McEwen wrote:
Jeff:
Thanks for the resources you provided. At the least, these will give me a rough roadmap. The reason that these won't do much more for me than this is because I program using a different platform, different filtering software, a different mail server, and in an entirely different programming language (not perl or php). Some of this stuff does not translate very well and, frankly, some of it looks like "greek" to me.
I program in Visual Basic.NET using the Microsoft .NET platform. Given that I'm also decent at C# programming, I think I could have done better if these examples were written in Java, for example.
Nevertheless, my original question... the search for a really good regular expression for extracting URIs... is a VERY platform-independent issue. Finding the best regular expression for this could assist a variety of people using a variety of programming platforms and programming languages who might also attempt to write software to work with SURBL.
Therefore, please consider posting this question on your website in the hopes that someone will provide a solution. I wouldn't be surprised if someone has already discovered the best Regular Expression for extracting URIs. Also, I believe that posting the answer on your site will be very helpful to others in my same predicament.
(Note that it is easy to find Regular Expressions which extract the full URL where the URL is preceded by an "href". But this is obviously not enough. I'd like to find one which takes into account the country codes for determining whether two or three levels are needed and then only extracts what SURBL actually needs. Also, I fully admit that I'd have this one solved if Regular Expressions were my specialty. They are not!)
Thanks for your consideration. Feel free to quote some or all of this e-mail at will.
Rob McEwen PowerView Systems rob@PowerViewSystems.com (478) 475-9032
-----Original Message----- From: Jeff Chan [mailto:jeffc@surbl.org] Sent: Monday, May 17, 2004 5:45 PM To: Rob McEwen Cc: jeffc@surbl.org; webmaster@PowerViewSystems.com Subject: Re: finding RegEx for extracting URIs
On Monday, May 17, 2004, 5:22:50 AM, Rob McEwen wrote:
I'm trying to find a good regular expression for extracting URIs from the raw text of an e-mail message. This would help tremendously for
programming
a SURBL based filter. Do you know of any such regular expression? The ones I've found on the internet so far are not very good, for numerous reasons.
Hi Rob, I think your best best would be to copy the code from urirhsbl in SpamAssassin URIDNSBL or SpamCopURI's code:
http://spamassassin.org/full/3.0.x/dist/lib/Mail/SpamAssassin/Plugin/URIDNSB L.pm
Jeff C.