On Thursday, May 6, 2004, 8:17:39 PM, Robert Menschel wrote:
Thursday, May 6, 2004, 3:52:55 AM, you wrote:
JC>> OK I've added a "new"-style regex to remove any subdomains on JC>> generic TLD domains: JC>> http://www.icann.org/tlds/ JC>> JC>> s/^([^.]*.)+([^.]*).(com|net|org|edu|mil|biz|info|int|arpa|name|museum|coop|aero|pro)$/\2.\3/
^^^^
The name.tld started life as a 3-level TLD. Many people have individual abc.def.name domains (eg: my own robert.menschel.name).
If you strip that third level, that means that if someone registers spammer.menschel.name (which I have no control over), since I cannot register menschel.name), and spammer.menschel.name then gets added to your lists, my robert.menschel.name will be collateral damage.
I realize that since the .name TLD now accepts both 2-level and 3-level domains (2-level is OK if nobody owns a 3-level domain with that 2nd level), this may be a very complex issue.
Thanks for the heads up Bob! To be safe, I've removed .name from this regex. This list should have names where only the second level is directly registerable. Now it looks like:
s/^([^.]*.)+([^.]*).(com|net|org|edu|mil|biz|info|int|arpa|museum|coop|aero|pro)$/\2.\3/
Any name not on this list may be processed at the third level, which of course includes many geographic TLDs.
Anyone know if .int or .arpa have any similar properties? OTOH, the more unusual TLDs don't seem to be used in spam very often; we see a lot of com and biz mostly.
<teaching mode for anyone interested>
I probably should have mentioned that this is a Posix new-style regex syntax used with sed, and that it differs from Perl for example in referring to the memorized portions as \2 and \3 instead of $2 and $3 as they would be in a Perl regex.
To explain the regex, [^.] is the class of characters other than dot, so ([^.]*.)+ means at least one of any sequence of zero or more non-dot characters, followed by a dot, followed by ([^.]*). or zero or more non-dot followed by a dot, followed by com, or net, or org, etc. Only the last two character sequences will be output by \2 dot \3, where \1 can be compound. Caret ^ and dollar $ anchor the pattern to the start and end of line, probably unnecessarily. Those *s could probably be +s, where + means 1 or more and * means zero or more.
</teaching mode for anyone interested>
Jeff C.