[SURBL-Discuss] Re: Bug in Spamcop's surbl add-on module

Jeff Chan jeffc at surbl.org
Thu May 6 22:29:53 CEST 2004


On Thursday, May 6, 2004, 8:17:39 PM, Robert Menschel wrote:

> Thursday, May 6, 2004, 3:52:55 AM, you wrote:
JC>> OK I've added a "new"-style regex to remove any subdomains on
JC>> generic TLD domains:
JC>>     http://www.icann.org/tlds/
JC>>
JC>> s/^([^\.]*\.)+([^\.]*)\.(com|net|org|edu|mil|biz|info|int|arpa|name|museum|coop|aero|pro)$/\2.\3/
>                                                                   ^^^^

> The name.tld started life as a 3-level TLD.  Many people have individual
> abc.def.name domains (eg: my own robert.menschel.name).

> If you strip that third level, that means that if someone registers
> spammer.menschel.name (which I have no control over), since I cannot
> register menschel.name), and spammer.menschel.name then gets added to
> your lists, my robert.menschel.name will be collateral damage.

> I realize that since the .name TLD now accepts both 2-level and 3-level
> domains (2-level is OK if nobody owns a 3-level domain with that 2nd
> level), this may be a very complex issue.

Thanks for the heads up Bob!  To be safe, I've removed .name from
this regex.  This list should have names where only the second level
is directly registerable.  Now it looks like:

  s/^([^\.]*\.)+([^\.]*)\.(com|net|org|edu|mil|biz|info|int|arpa|museum|coop|aero|pro)$/\2.\3/

Any name not on this list may be processed at the third level,
which of course includes many geographic TLDs.

Anyone know if .int or .arpa have any similar properties?  OTOH,
the more unusual TLDs don't seem to be used in spam very often;
we see a lot of com and biz mostly.

<teaching mode for anyone interested>

I probably should have mentioned that this is a Posix new-style
regex syntax used with sed, and that it differs from Perl for
example in referring to the memorized portions as \2 and \3
instead of $2 and $3 as they would be in a Perl regex.

To explain the regex, [^\.] is the class of characters other than
dot, so ([^\.]*\.)+ means at least one of any sequence of zero or
more non-dot characters, followed by a dot, followed by
([^\.]*)\. or zero or more non-dot followed by a dot, followed by
com, or net, or org, etc.  Only the last two character sequences
will be output by \2 dot \3, where \1 can be compound.  Caret ^
and dollar $ anchor the pattern to the start and end of line,
probably unnecessarily.  Those *s could probably be +s, where +
means 1 or more and * means zero or more.

</teaching mode for anyone interested>

Jeff C.



More information about the Discuss mailing list