I'm adding a very brief section to the Quick Start of the SURBL site:
Implementation guidelines
Here are some very brief instructions for folks writing software to use SURBL lists: You code should:
- Extract URIs from message bodies
- (Extraction of URIs from message bodies should ideally
include full resolution of redirections into the final target domain name. This can be a non-trivial problem.)
- Extract base (registrar) domains from those URIs
- Not do name resolution on the domains
- Look up the domain name in the SURBL by prepending it to
the name of the SURBL, e.g., domainundertest.com.sc.surbl.org then doing Address record DNS resolution. A non-result indicates lack of inclusion in the list. A result of 127.0.0.2 represents inclusion.
- Handle numeric IPs in URIs similarly, but reverse the
octet ordering before comparison against the RBL. This is standard practice for RBLs. For example, http://1.2.3.4/ is checked as 4.3.2.1.sc.surbl.org
SURBL lists unusually have both names and numbers in the same list. For example, 2.0.0.127 and example.com and similar actual spam domains and addresses are both in all SURBL lists. Numbered addresses in SURBLs are to have occurred as numbers in spams.
Can anyone think of any additions, corrections, or suggestions before I announce it more broadly?
Jeff C.
I've updated the SURBL Implementation Guidelines page slightly:
http://www.surbl.org/implementation.html
Implementation Guidelines
Here are some very brief guidelines for folks writing software to use SURBL lists: Your code should:
- Extract URIs from message bodies. (Extraction of URIs
from message bodies should ideally include full resolution of redirections into the final target domain name. This can be a non-trivial problem.) 2. Extract base (registrar) domains from those URIs. This includes removing any and all leading host names, subdomains, www., randomized subdomains, etc. In order to determine the base domain it may be necessary to use a table of country code TLDs (ccTLDs) such as the partially-imcomplete one SURBL uses. 3. Not do name resolution on the domains. 4. Look up the domain name in the SURBL by prepending it to the name of the SURBL, e.g., domainundertest.com.sc.surbl.org then doing Address record DNS resolution. A non-result indicates lack of inclusion in the list. A result of 127.0.0.2 represents inclusion, i.e., probable spam. 5. Handle numeric IPs in URIs similarly, but reverse the octet ordering before comparison against the RBL. This is standard practice for RBLs. For example, http://1.2.3.4/ is checked as 4.3.2.1.sc.surbl.org.
SURBL lists unusually have both names and numbers in the same list. For example, 2.0.0.127 and test.surbl.org and similar actual spam domains and addresses are both in all SURBL lists. Numbered addresses in SURBLs should have occurred in spams as numbers, e.g.: literally http://1.2.3.4/.
Would still like comments about anything I may have left out or anything else before I announce it.
Thanks,
Jeff C.
At 12:43 19/04/2004, Jeff Chan wrote:
- Extract URIs from message bodies. (Extraction of URIs
from message bodies should ideally include full resolution of redirections into the final target domain name. This can be a non-trivial problem.)
Indeed :)
- Extract base (registrar) domains from those URIs. This
includes removing any and all leading host names, subdomains, www., randomized subdomains, etc. In order to determine the base domain it may be necessary to use a table of country code TLDs (ccTLDs) such as the partially-imcomplete one SURBL uses.
Ok, now this one worries me a little bit - how well is this handled currently in SpamCopURI and SA 3.0 ? Because while I was looking through the SpamCopURI source code, I saw a comment that said:
# # take foo.bar.yahoo.com to yahoo.com # # this kind of breaks for co.uk and # # we could get false domain level matches
Here in New Zealand our domain heirachy is 3rd level the same as .uk - the country code is .nz and the second level is one of only a few specifically allowed by the registrar - co,net,gen,school,govt and a few others... (can't remember them all off hand, but theres less than 10)
It's the third level which is delegated to individual organisations. For example our email domain is igrin.co.nz.
If a spammer were to register a domain in NZ it would look like:
spammer.co.nz or spammer.net.nz or spammer.gen.nz etc.... randomised subdomains that they could create on their own nameservers would look like a65423xyz.spammer.co.nz or awef3242.fssf342.spammer.co.nz etc...
Will the current code (of both SpamCopURI, and the backend processing of the surbl servers for that matter) incorrectly strip this off to co.nz ? I ask, because I have definately seen dns queries from SpamCopURI trying to look up co.nz.sc.surbl.org which is wrong - that would cover a large fraction of the websites under the NZ domain heirachy, it should be looking up spammer.co.nz, never co.nz.
Is there any reliable way for the code to know what a base registrar domain is and how many tiers there are under that domain heirachy ? (May also be a non-trivial problem)
Regards, Simon
On Sunday, April 18, 2004, 6:08:11 PM, Simon Byrnand wrote:
At 12:43 19/04/2004, Jeff Chan wrote:
- Extract base (registrar) domains from those URIs. This
includes removing any and all leading host names, subdomains, www., randomized subdomains, etc. In order to determine the base domain it may be necessary to use a table of country code TLDs (ccTLDs) such as the partially-imcomplete one SURBL uses.
[...]
If a spammer were to register a domain in NZ it would look like:
spammer.co.nz or spammer.net.nz or spammer.gen.nz etc.... randomised subdomains that they could create on their own nameservers would look like a65423xyz.spammer.co.nz or awef3242.fssf342.spammer.co.nz etc...
Will the current code (of both SpamCopURI, and the backend processing of the surbl servers for that matter) incorrectly strip this off to co.nz ? I ask, because I have definately seen dns queries from SpamCopURI trying to look up co.nz.sc.surbl.org which is wrong - that would cover a large fraction of the websites under the NZ domain heirachy, it should be looking up spammer.co.nz, never co.nz.
Is there any reliable way for the code to know what a base registrar domain is and how many tiers there are under that domain heirachy ? (May also be a non-trivial problem)
The traditional solution to ccTLDs (Country Code TLDs) seems to be to make a table of them, and make sure any extracted domains are +1 domain levels longer. So for company.co.nz, don't take co.nz as the base domain, but instead use company.co.nz since we know from the table that co.nz is a two level country code TLD. My slightly incomplete table of ccTLDs is at:
http://spamcheck.freeapp.net/two-level-tlds
I think SpamAssassin (3.0?) in general has code to do that. I'm sure SpamCop's internal processing of URIs also takes it into account. I'm not sure how Eric's SpamCopURI currently handles it. I do know that the current sc.surbl.org data engine will capture them correctly and I have somewhat of a kludge to get rid of the two level ccTLDs that would otherwise get through by letting the engine to all the processing on them, then suppressing their output with a whitelist which includes the two level ccTLD domains. Probably it would be better to increase the cutoff to three levels instead of two in my code whenever handling a two-level ccTLD such as co.nz to prevent the procesing of two-level ccTLDs themselves in the first place while still leaving the processing of longer ccTLD domains (i.e. complete ones like company.co.nz) in place.
So the quick answer is that the data side of sc.surbl.org has it pretty much covered, and I'm not sure about the message parsing side of things in SA 2.63 and 3.0.
Jeff C.
At 13:49 19/04/2004, Jeff Chan wrote:
On Sunday, April 18, 2004, 6:08:11 PM, Simon Byrnand wrote:
At 12:43 19/04/2004, Jeff Chan wrote:
- Extract base (registrar) domains from those URIs. This
includes removing any and all leading host names, subdomains, www., randomized subdomains, etc. In order to determine the base domain it may be necessary to use a table of country code TLDs (ccTLDs) such as the partially-imcomplete one SURBL uses.
[...]
If a spammer were to register a domain in NZ it would look like:
spammer.co.nz or spammer.net.nz or spammer.gen.nz etc.... randomised subdomains that they could create on their own nameservers would look like a65423xyz.spammer.co.nz or awef3242.fssf342.spammer.co.nz etc...
Will the current code (of both SpamCopURI, and the backend processing of the surbl servers for that matter) incorrectly strip this off to co.nz ? I ask, because I have definately seen dns queries from SpamCopURI trying to look up co.nz.sc.surbl.org which is wrong - that would cover a large fraction of the websites under the NZ domain heirachy, it should be
looking
up spammer.co.nz, never co.nz.
Is there any reliable way for the code to know what a base registrar
domain
is and how many tiers there are under that domain heirachy ? (May also
be a
non-trivial problem)
The traditional solution to ccTLDs (Country Code TLDs) seems to be to make a table of them, and make sure any extracted domains are +1 domain levels longer. So for company.co.nz, don't take co.nz as the base domain, but instead use company.co.nz since we know from the table that co.nz is a two level country code TLD. My slightly incomplete table of ccTLDs is at:
Hmm, well your list has .co.nz and .net.nz but not .school.nz (as an example)
What are the relative proportions of one level to two level country code TLD's ?
Are there any other one level hierachies used by countries, apart from the generic .com .org .net .biz etc ? Might be easier (and safer ?) to assume the other way around - assume its a two level country code unless listed. Then you're only having to list the top level (.com for example) rather than trying to keep track of things like .co.nz, .net.nz and so on, which are subject to change at the discretion of the local registrar...
Maybe I missed something :)
Regards, Simon
On Sunday, April 18, 2004, 6:58:14 PM, Simon Byrnand wrote:
At 13:49 19/04/2004, Jeff Chan wrote:
The traditional solution to ccTLDs (Country Code TLDs) seems to be to make a table of them, and make sure any extracted domains are +1 domain levels longer. So for company.co.nz, don't take co.nz as the base domain, but instead use company.co.nz since we know from the table that co.nz is a two level country code TLD. My slightly incomplete table of ccTLDs is at:
Hmm, well your list has .co.nz and .net.nz but not .school.nz (as an example)
OK I added school.nz. Anyeone know any others to add? Contact me off lists. :-) The list of ccTLDs came mostly from a registrar's:
http://www.bestregistrar.com/help/ccTLD.htm
What are the relative proportions of one level to two level country code TLD's ?
See below. In terms of spam domains ccTLDs are not a major problem. .com, .biz, .net have far more spam domains.
Are there any other one level hierachies used by countries, apart from the generic .com .org .net .biz etc ? Might be easier (and safer ?) to assume the other way around - assume its a two level country code unless listed. Then you're only having to list the top level (.com for example) rather than trying to keep track of things like .co.nz, .net.nz and so on, which are subject to change at the discretion of the local registrar...
Yes, that's part of the problem. Local TLD authorities seem to be able to add whatever TLDs they like under their own CC. Still I think ccTLDs should be regarded as minor. Certainly they are not a major destination for spam messages. Given that, handling the non-ccTLDs as a first priority is probably the most efficient.
Here are some relative rankings of the TLDs in domain reports I have from a couple weeks worth of SpamCop URI reports:
TLD Count of reports --- ---- com 1938 biz 424 net 322 info 90 org 79 us 39 ru 21 de 20 tv 13 nl 12 to 10 ph 8 cn 8 cc 7 br 7 tw 6 pl 6 ch 6 ws 5 it 5 fr 5 es 5 ro 4 jp 4 cl 4 nu 3 kr 3 cz 3 co 3 za 2 uk 2 se 2 pt 2
Jeff C.
On Mon, Apr 19, 2004 at 01:08:11PM +1200, Simon Byrnand wrote:
At 12:43 19/04/2004, Jeff Chan wrote:
- Extract URIs from message bodies. (Extraction of URIs
from message bodies should ideally include full resolution of redirections into the final target domain name. This can be a non-trivial problem.)
Indeed :)
- Extract base (registrar) domains from those URIs. This
includes removing any and all leading host names, subdomains, www., randomized subdomains, etc. In order to determine the base domain it may be necessary to use a table of country code TLDs (ccTLDs) such as the partially-imcomplete one SURBL uses.
Ok, now this one worries me a little bit - how well is this handled currently in SpamCopURI and SA 3.0 ? Because while I was looking through the SpamCopURI source code, I saw a comment that said:
# # take foo.bar.yahoo.com to yahoo.com # # this kind of breaks for co.uk and # # we could get false domain level matches
Here in New Zealand our domain heirachy is 3rd level the same as .uk - the country code is .nz and the second level is one of only a few specifically allowed by the registrar - co,net,gen,school,govt and a few others... (can't remember them all off hand, but theres less than 10)
It's the third level which is delegated to individual organisations. For example our email domain is igrin.co.nz.
If a spammer were to register a domain in NZ it would look like:
spammer.co.nz or spammer.net.nz or spammer.gen.nz etc.... randomised subdomains that they could create on their own nameservers would look like a65423xyz.spammer.co.nz or awef3242.fssf342.spammer.co.nz etc...
Will the current code (of both SpamCopURI, and the backend processing of the surbl servers for that matter) incorrectly strip this off to co.nz ? I ask, because I have definately seen dns queries from SpamCopURI trying to look up co.nz.sc.surbl.org which is wrong - that would cover a large fraction of the websites under the NZ domain heirachy, it should be looking up spammer.co.nz, never co.nz.
Currently SpamCopURI checks both the 2nd and 3rd level domain regardless of the TLD. I believe SA 3.0 does a little better job of this.
Worst case scenario is two queries instead of one.
--eric
Is there any reliable way for the code to know what a base registrar domain is and how many tiers there are under that domain heirachy ? (May also be a non-trivial problem)
Regards, Simon
Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss
On Sunday, April 18, 2004, 7:53:46 PM, Eric Kolve wrote:
Currently SpamCopURI checks both the 2nd and 3rd level domain regardless of the TLD. I believe SA 3.0 does a little better job of this.
Sounds good. That should catch everything with few false positives, since we're filtering out most ccTLDs on the data side and not too many get reported in the first place.
Jeff C.
Jeff Chan wrote:
On Sunday, April 18, 2004, 7:53:46 PM, Eric Kolve wrote:
Currently SpamCopURI checks both the 2nd and 3rd level domain regardless of the TLD. I believe SA 3.0 does a little better job of this.
Sounds good. That should catch everything with few false positives, since we're filtering out most ccTLDs on the data side and not too many get reported in the first place.
Why not declare wildcards records at DNS to solve randomness and doing a single DNS query ?
Instead of declaring
spammer.com A 127.0.0.1
you could declare :
spammer.com A 127.0.0.2 *.spammer.com A 127.0.0.2
Does this work ?
I run a home rbl and I do this to block an entire IP class.
Jose-Marcio
Jeff C.
Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss
On Tuesday, April 20, 2004, 8:39:50 AM, Jose Cruz wrote:
Jeff Chan wrote:
On Sunday, April 18, 2004, 7:53:46 PM, Eric Kolve wrote:
Currently SpamCopURI checks both the 2nd and 3rd level domain regardless of the TLD. I believe SA 3.0 does a little better job of this.
Sounds good. That should catch everything with few false positives, since we're filtering out most ccTLDs on the data side and not too many get reported in the first place.
Why not declare wildcards records at DNS to solve randomness and doing a single DNS query ?
Instead of declaring
spammer.com A 127.0.0.1
you could declare :
spammer.com A 127.0.0.2 *.spammer.com A 127.0.0.2
Does this work ?
We're taking the opposite approach and removing the wildcards on the client side before comparing them to the SURBL. That way the SURBL only gets the base domains. The net results should be similar; the difference is where the randomness is resolved. We also considered doing wildcard DNS but like the former approach better.
Jeff C.
On Tuesday, April 20, 2004, 8:39:50 AM, Jose Cruz wrote:
Jeff Chan wrote:
On Sunday, April 18, 2004, 7:53:46 PM, Eric Kolve wrote:
Currently SpamCopURI checks both the 2nd and 3rd level domain regardless of the TLD. I believe SA 3.0 does a little better job of this.
Sounds good. That should catch everything with few false positives, since we're filtering out most ccTLDs on the data side and not too many get reported in the first place.
Why not declare wildcards records at DNS to solve randomness and doing a single DNS query ?
Instead of declaring
spammer.com A 127.0.0.1
you could declare :
spammer.com A 127.0.0.2 *.spammer.com A 127.0.0.2
Does this work ?
We're taking the opposite approach and removing the wildcards on the client side before comparing them to the SURBL. That way the SURBL only gets the base domains. The net results should be similar; the difference is where the randomness is resolved. We also considered doing wildcard DNS but like the former approach better.
I'm not 100% sure on the behaviour of wildcard records, but if a client looks up a record that is actually a wildcard on the server, can the local nameserver cache it as a wildcard, or does it just cache the specific match ?
In other words, if SA looked up abc.spammer.com.sc.surbl.org and the abc part was actually a wildcard, would the local caching nameserver cache abc.spammer.com.sc.surbl.org or *.spammer.com.sc.surbl.org ?
If a second query came along for xyz.spammer.com.sc.surbl.org and abc was specifically cached, it couldn't be returned from the local cache.
With the current system of stripping the domains before making the query, the local caching nameserver should be able to do a better job of caching requests in that case...
So which way does it actually work in practice ? :)
Regards, Simon
Simon Byrnand wrote:
In other words, if SA looked up abc.spammer.com.sc.surbl.org and the abc part was actually a wildcard, would the local caching nameserver cache abc.spammer.com.sc.surbl.org or *.spammer.com.sc.surbl.org ?
abc.spammer.com.sc.surbl.org
The local DNS server never sees the wildcard.
David
On Tuesday, April 20, 2004, 6:14:31 PM, David Coulson wrote:
Simon Byrnand wrote:
In other words, if SA looked up abc.spammer.com.sc.surbl.org and the abc part was actually a wildcard, would the local caching nameserver cache abc.spammer.com.sc.surbl.org or *.spammer.com.sc.surbl.org ?
abc.spammer.com.sc.surbl.org
The local DNS server never sees the wildcard.
Good to know. So wildcards sound like they *don't* necessarily save on DNS traffic, right?
Jeff C.
Jeff Chan wrote:
Good to know. So wildcards sound like they *don't* necessarily save on DNS traffic, right?
Right, but if one has a local AXFR or rsync of the zone, that's not an issue. They DO massivly save on memory on the DNS server if you've got a massive zone.
David
On Tuesday, April 20, 2004, 6:27:00 PM, David Coulson wrote:
Jeff Chan wrote:
Good to know. So wildcards sound like they *don't* necessarily save on DNS traffic, right?
Right, but if one has a local AXFR or rsync of the zone, that's not an issue. They DO massivly save on memory on the DNS server if you've got a massive zone.
Going with the base domain on the client side sounds like it was a good move then... ;-)
Jeff C.
Jeff Chan wrote:
Good to know. So wildcards sound like they *don't* necessarily save on DNS traffic, right?
Right, but if one has a local AXFR or rsync of the zone, that's not an issue. They DO massivly save on memory on the DNS server if you've got a massive zone.
Only if the wildcard is replacing multiple entries though ? Which is not the case with surbl.org. Here we're considering the difference between having one entry like:
spammer.com IN A 127.0.0.2
or one entry like:
*.spammer.com IN A 127.0.0.2
With the current approach individual randomized (or not) subdomains aren't being seperately listed anyway, they are stripped down and collated into their registrar level domain names before going into the zone files.. (Right Jeff ?)
Same number of records, just a different representation which requires the client end to do the same stripping down, (slightly more work) but with the added bonus of much better caching on the client nameservers..
Regards, Simon
On Tuesday, April 20, 2004, 6:59:29 PM, Simon Byrnand wrote:
*.spammer.com IN A 127.0.0.2
With the current approach individual randomized (or not) subdomains aren't being seperately listed anyway, they are stripped down and collated into their registrar level domain names before going into the zone files.. (Right Jeff ?)
Yes, the wildcard case isn't really a factor in the way we're currently doing things with SURBLs. It was an issue we considered before and decided not to go with, so it's of historical interest.
Jeff C.
Simon Byrnand wrote:
Jeff Chan wrote:
Good to know. So wildcards sound like they *don't* necessarily save on DNS traffic, right?
Right, but if one has a local AXFR or rsync of the zone, that's not an issue. They DO massivly save on memory on the DNS server if you've got a massive zone.
Only if the wildcard is replacing multiple entries though ? Which is not the case with surbl.org. Here we're considering the difference between having one entry like:
spammer.com IN A 127.0.0.2
or one entry like:
*.spammer.com IN A 127.0.0.2
With the current approach individual randomized (or not) subdomains aren't being seperately listed anyway, they are stripped down and collated into their registrar level domain names before going into the zone files.. (Right Jeff ?)
Same number of records, just a different representation which requires the client end to do the same stripping down, (slightly more work) but with the added bonus of much better caching on the client nameservers..
No. You need to have both records. The first will match only the domain itself : "spammer.com" and the second will match everything other. The wildcard doesn't match the domain itself. So the number of records is the double - but maybe I'm wrong.
It seems to me that wildcards is what spammers use to get hostname randomness.
But, IMHO, all this doesn't really matter. What's important is to optimize global delays, which are the sum of : - DNS query handling delay - network delay - client query handling delay
I don't have enough data but I'll surely have some benchmarks soon. It seems to me that network delay is much larger than the others. So, probably the better way to do is that one which generates less network traffic.
Best,
Jose-Marcio
Regards, Simon
Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss
On Wednesday, April 21, 2004, 12:02:26 AM, Jose Cruz wrote:
Simon Byrnand wrote:
With the current approach individual randomized (or not) subdomains aren't being seperately listed anyway, they are stripped down and collated into their registrar level domain names before going into the zone files.. (Right Jeff ?)
No. You need to have both records. The first will match only the domain itself : "spammer.com" and the second will match everything other. The wildcard doesn't match the domain itself. So the number of records is the double - but maybe I'm wrong.
It seems to me that wildcards is what spammers use to get hostname randomness.
We're discarding the randomness on the client end by stripping off all the subdomains and host names, random or not.
Or at least that's what any code using SRUBLs *should be doing*, because that's what's represented in the list data: base domains. We want to compare base domains extracted from the messages against the base domains in the SURBLs. One source of confusion is that sc.surbl.org data includes some of the more common randomized subdomains; future versions of the data won't. Only base domains will be included in future versions of sc.surbl.org, so only base domains should be compared by the client program from now on.
It also happens that this makes for more streamlined use of DNS for the RBL.
Jeff C.
Simon Byrnand wrote:
In other words, if SA looked up abc.spammer.com.sc.surbl.org and the abc part was actually a wildcard, would the local caching nameserver cache abc.spammer.com.sc.surbl.org or *.spammer.com.sc.surbl.org ?
abc.spammer.com.sc.surbl.org
The local DNS server never sees the wildcard.
Ok, definately no good from a point of view of local caching then, which is what I thought. The current method of stripping the domain back to the registrar level and querying that seems to be the most efficient from the client end perspective, as randomized subdomains can be efficiently handled since only the base domain needs to be cached...
Regards, Simon