[SURBL-Discuss] mbox parser
tdeppner at surewest.net
Mon May 17 23:38:31 CEST 2004
On Mon, May 17, 2004 at 05:19:39PM -0700, Jeff Chan wrote:
> On Monday, May 17, 2004, 5:07:32 PM, David Coulson wrote:
> > I've got a decent mailbox containing a variety of spam e-mail. Is there
> > a nice little Perl script out there which will spit out the URLs so I
> > can submit them to Bill's list?
> Someone else asked about this recently, saying he could not find
> a good message body URI parser. Presumably the reason is that
> it's a little more complicated than it may seem at first, given
> the need to decode MIME, weird cases, etc. I suggested starting
> with some of the code form SpamCopURI or urirhsbl from the SA 3.0
> URIBL module.
I wrote one for my company's use, but I'm not certain of my ability to
release it publically... have to check in on that.
A short list of the necessary decodes:
- undo MIME wordwraps (/=$/)
- URL %HH encoding
- HTML #decimal; encoding
- HTML #0xhexa; encoding
Note, it's a good idea to parse for URL like things before and after MIME
and UUencode decodings.
We wrote a generic message parser that our perl SMTPD replacement uses,
and the command line tools run against the same code. This means we're
always consistent in our interpretation of URLs. This becomes necessary
when you're also matching for "traditional" URIs like stock symbols and
More information about the Discuss