On Wednesday, May 5, 2004, 2:10:36 PM, Chris Santerre wrote:
From: Jeff Chan [mailto:jeffc@surbl.org]
Hi Chris, Would you mind if I added a quick regex to remove and third or higher level domains from .com, .biz, .net, .info, etc. from domains before they go into be? It wouldn't be perfect but it could help some.
In other words trim down e.1asphost.com to 1asphost.com (etc) in my own data munging?
Jeff my friend, nothing would make me happier :)
OK I've added a "new"-style regex to remove any subdomains on generic TLD domains:
s/^([^.]*.)+([^.]*).(com|net|org|edu|mil|biz|info|int|arpa|name|museum|coop|aero|pro)$/\2.\3/
It seems to do the right thing, both on test cases and the actual data, so it's now live on all the lists. If anyone sees any problems with this regex, please let me know.
Bill's domains from sa-blacklist are already in the correct form :-) and have no subdomains on these gTLD domains going into ws.surbl.org. I added it also to sc.surbl.org which did get rid of a few errant records, so I should probably announce the change. Subdomains are now properly removed in be and sc, as they should have been.
This should result in better matching on both be and sc since the clients are supposed to be doing similar things with message URIs.
Jeff C.