This is a forwarded message From: Theo Van Dinter felicity@kluge.net To: SURBL Discussion list discuss@lists.surbl.org, SpamAssassin Developers spamassassin-dev@incubator.apache.org Date: Saturday, September 4, 2004, 10:36:53 AM Subject: [SURBL-Discuss] checking plain domains in message bodies against SURBLs reportedly effective
===8<==============Original message text=============== On Sat, Sep 04, 2004 at 10:45:44AM -0600, Ryan Thompson wrote:
Yep. Good idea, overall. There are a few gotchas:
TLD extensions sometimes map file extensions. We might have to whitelist command.com, and the entire country of Poland. :-)
Since the domain is in plain text and doesn't contain a protocol or subdomain (i.e., 'www'), I haven't yet seen a mail client that will display it as a clickable URL.
This is generally the tact we're taking in SpamAssassin -- if a general MUA doesn't display it as a link, then we don't consider it an URL.
Another issue for the generic domains thing is performance -- lots of messages have lots of things like could potentially look like a domain, and querying for them all adds a bit of a load on the client and the server.
For instance: /\b([a-zA-Z0-9_.-]{1,256}.[a-zA-Z]{2,6})\b/
in theory (I haven't tested it), will grab anything that looks like a generic domain name in text. If you check that list against a list of valid TLDs, you'd probably end up with a decent list, but you'd hit the top issue quoted above where "Go take a look at command.com" isn't clear if it's an URL or a filename.