On Mon, May 17, 2004 at 05:19:39PM -0700, Jeff Chan wrote:
On Monday, May 17, 2004, 5:07:32 PM, David Coulson wrote:
I've got a decent mailbox containing a variety of spam e-mail. Is there a nice little Perl script out there which will spit out the URLs so I can submit them to Bill's list?
Someone else asked about this recently, saying he could not find a good message body URI parser. Presumably the reason is that it's a little more complicated than it may seem at first, given the need to decode MIME, weird cases, etc. I suggested starting with some of the code form SpamCopURI or urirhsbl from the SA 3.0 URIBL module.
I wrote one for my company's use, but I'm not certain of my ability to release it publically... have to check in on that.
A short list of the necessary decodes:
- Mime - UUencode - undo MIME wordwraps (/=$/) - URL %HH encoding - HTML #decimal; encoding - HTML #0xhexa; encoding
Note, it's a good idea to parse for URL like things before and after MIME and UUencode decodings.
We wrote a generic message parser that our perl SMTPD replacement uses, and the command line tools run against the same code. This means we're always consistent in our interpretation of URLs. This becomes necessary when you're also matching for "traditional" URIs like stock symbols and phone numbers.