On Mon, Apr 19, 2004 at 01:08:11PM +1200, Simon Byrnand wrote:
At 12:43 19/04/2004, Jeff Chan wrote:
- Extract URIs from message bodies. (Extraction of URIs
from message bodies should ideally include full resolution of redirections into the final target domain name. This can be a non-trivial problem.)
Indeed :)
- Extract base (registrar) domains from those URIs. This
includes removing any and all leading host names, subdomains, www., randomized subdomains, etc. In order to determine the base domain it may be necessary to use a table of country code TLDs (ccTLDs) such as the partially-imcomplete one SURBL uses.
Ok, now this one worries me a little bit - how well is this handled currently in SpamCopURI and SA 3.0 ? Because while I was looking through the SpamCopURI source code, I saw a comment that said:
# # take foo.bar.yahoo.com to yahoo.com # # this kind of breaks for co.uk and # # we could get false domain level matches
Here in New Zealand our domain heirachy is 3rd level the same as .uk - the country code is .nz and the second level is one of only a few specifically allowed by the registrar - co,net,gen,school,govt and a few others... (can't remember them all off hand, but theres less than 10)
It's the third level which is delegated to individual organisations. For example our email domain is igrin.co.nz.
If a spammer were to register a domain in NZ it would look like:
spammer.co.nz or spammer.net.nz or spammer.gen.nz etc.... randomised subdomains that they could create on their own nameservers would look like a65423xyz.spammer.co.nz or awef3242.fssf342.spammer.co.nz etc...
Will the current code (of both SpamCopURI, and the backend processing of the surbl servers for that matter) incorrectly strip this off to co.nz ? I ask, because I have definately seen dns queries from SpamCopURI trying to look up co.nz.sc.surbl.org which is wrong - that would cover a large fraction of the websites under the NZ domain heirachy, it should be looking up spammer.co.nz, never co.nz.
Currently SpamCopURI checks both the 2nd and 3rd level domain regardless of the TLD. I believe SA 3.0 does a little better job of this.
Worst case scenario is two queries instead of one.
--eric
Is there any reliable way for the code to know what a base registrar domain is and how many tiers there are under that domain heirachy ? (May also be a non-trivial problem)
Regards, Simon
Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss