ccTLDs and multiple queries

List overview All Threads
Download

newer

older

[Fwd: Re: [SURBL-Discuss] First...

Using Mulitple zones

Eric Kolve

21 Apr 2004 21 Apr '04

4:37 p.m.

Initially, when I released spamcopuri I decided to pretty much ignore whether the TLD was a country code or not. This results in about twice as many queries as necessary, but guaranteed you would get hits if the domain was listed.

Now that people are pointing this to other RBL's beside just surbl, should we continue to do second and third level queries? Or just the query that we assume to be necessary? My concern is that not all RBLs will process the domains according to a list such as http://www.bestregistrar.com/help/ccTLD.htm. I suppose the worst case scenario is we end up getting a miss when we should be getting a hit because one side presumes that say TLD .za has a subdomain 'foo', when the server doesn't. The server side would expect a second level, while the client would do a third level query (this is why I wanted the wildcard records). I guess this really isn't that great a consequence considering the savings and the fact that this shouldn't occur very often.

I will go ahead and make this change if everyone is comfortable with the known risk.

thanks,

--eric

Show replies by date

John Fawcett

21 Apr 21 Apr

9:21 p.m.

----- Original Message ----- From: "Eric Kolve"

...

Initially, when I released spamcopuri I decided to pretty much ignore whether the TLD was a country code or not. This results in about twice as many queries as necessary, but guaranteed you would get hits if the domain was listed.

Now that people are pointing this to other RBL's beside just surbl, should we continue to do second and third level queries? Or just the query that we assume to be necessary? My concern is that not all RBLs will process the domains according to a list such as http://www.bestregistrar.com/help/ccTLD.htm. I suppose the worst case scenario is we end up getting a miss when we should be getting a hit because one side presumes that say TLD .za has a subdomain 'foo', when the server doesn't. The server side would expect a second level,

while

...

the client would do a third level query (this is why I wanted the wildcard records). I guess this really isn't that great a consequence considering the savings and the fact that this shouldn't occur very often.

I will go ahead and make this change if everyone is comfortable with the known risk.

I think if an rhsbl is listing a second level registry domain (like .co.uk) then I think it's up to the list maintainer to implement the wild card so that xxxxx.co.uk returns an A record. I wouldn't worry about taking into account such an extreme case, since I cannot imagine any list wanting to do such widespread blocking.

I believe there should be a mechanism which distinguishes whether a second or third level lookup is required based on a static lists of domains known to have or not have subdomains. If nothing is known then the default should be to check both second and third as at present.

John

Jeff Chan

22 Apr 22 Apr

midnight

On Wednesday, April 21, 2004, 12:21:16 PM, John Fawcett wrote:

...

From: "Eric Kolve"

...
Initially, when I released spamcopuri I decided to pretty much ignore whether the TLD was a country code or not. This results in about twice as many queries as necessary, but guaranteed you would get hits if the domain was listed.

Now that people are pointing this to other RBL's beside just surbl, should we continue to do second and third level queries? Or just the query that we assume to be necessary? My concern is that not all RBLs will process the domains according to a list such as http://www.bestregistrar.com/help/ccTLD.htm. I suppose the worst case scenario is we end up getting a miss when we should be getting a hit because one side presumes that say TLD .za has a subdomain 'foo', when the server doesn't. The server side would expect a second level,

while

...
the client would do a third level query (this is why I wanted the wildcard records). I guess this really isn't that great a consequence considering the savings and the fact that this shouldn't occur very often.

I will go ahead and make this change if everyone is comfortable with the known risk.

I think if an rhsbl is listing a second level registry domain (like .co.uk) then I think it's up to the list maintainer to implement the wild card so that xxxxx.co.uk returns an A record. I wouldn't worry about taking into account such an extreme case, since I cannot imagine any list wanting to do such widespread blocking.

Yes, the two level ccTDLs like co.uk should never get into a SURBL. Only registrar-type domains should, like foo.co.uk.

...

I believe there should be a mechanism which distinguishes whether a second or third level lookup is required based on a static lists of domains known to have or not have subdomains. If nothing is known then the default should be to check both second and third as at present.

Aha, now I think I understand what's being proposed.

Currently SpamCopURI checks all domains at the second and third level against a given SURBL, regardless of whether the domain is in a ccTLD or not.

It sounds like Eric is proposing a change, where if a domain is in the ccTLD list like co.uk, then the client should try extract and check a three level domain like foo.co.uk. Otherwise it should check two levels like foo.com.

Is that right? If so it may be ok, though our list of ccTLDs is slightly underspecified (there are some ccTLDs not in it). Note that my ccTLD list:

http://spamcheck.freeapp.net/two-level-tlds

is (derived from but) slightly more complete than the one at http://www.bestregistrar.com/help/ccTLD.htm ....

Worst case is that we miss a few ccTLDs. Probably not too big a deal given that most of the spam domains are .com, .biz, etc.

I believe Eric is also making a finer point that other SURBL data sources may miss some unexpected geographic domains where foo.za occurred and only two-level base-ccTLDs like foo.com.za were expected. Not sure how to handle unusual cases like that. I suppose we'll need to relay on the country code authorities to be somewhat consistent with respect to what levels they will allow in their ccTLD.

Philosophical point: it's always possible that some spam domains slip through the cracks, but if that happens often enough and we spot them, we can always blacklist them manually. Perfection may not be possible, but we're certainly greatly increasing the spam detection rates with this approach overall.

BTW I'm using the ccTLD list to try to ensure that any two level ccTLDs do *not* get into any SURBLs.

P.S. ***If anyone has more-complete ccTLD lists or any updates or additions, please share them,*** else spammers may set up shop in some unknown Outer Mongolian ccTLD, and we may not catch them. :-)

Jeff C.

Eric Kolve

12:12 a.m.

On Wed, Apr 21, 2004 at 03:00:52PM -0700, Jeff Chan wrote:

...

On Wednesday, April 21, 2004, 12:21:16 PM, John Fawcett wrote:

...
From: "Eric Kolve"

...
Initially, when I released spamcopuri I decided to pretty much ignore whether the TLD was a country code or not. This results in about twice as many queries as necessary, but guaranteed you would get hits if the domain was listed.

Now that people are pointing this to other RBL's beside just surbl, should we continue to do second and third level queries? Or just the query that we assume to be necessary? My concern is that not all RBLs will process the domains according to a list such as http://www.bestregistrar.com/help/ccTLD.htm. I suppose the worst case scenario is we end up getting a miss when we should be getting a hit because one side presumes that say TLD .za has a subdomain 'foo', when the server doesn't. The server side would expect a second level,

while

...
the client would do a third level query (this is why I wanted the wildcard records). I guess this really isn't that great a consequence considering the savings and the fact that this shouldn't occur very often.

I will go ahead and make this change if everyone is comfortable with the known risk.

I think if an rhsbl is listing a second level registry domain (like .co.uk) then I think it's up to the list maintainer to implement the wild card so that xxxxx.co.uk returns an A record. I wouldn't worry about taking into account such an extreme case, since I cannot imagine any list wanting to do such widespread blocking.

Yes, the two level ccTDLs like co.uk should never get into a SURBL. Only registrar-type domains should, like foo.co.uk.

...
I believe there should be a mechanism which distinguishes whether a second or third level lookup is required based on a static lists of domains known to have or not have subdomains. If nothing is known then the default should be to check both second and third as at present.

Aha, now I think I understand what's being proposed.

Currently SpamCopURI checks all domains at the second and third level against a given SURBL, regardless of whether the domain is in a ccTLD or not.

It sounds like Eric is proposing a change, where if a domain is in the ccTLD list like co.uk, then the client should try extract and check a three level domain like foo.co.uk. Otherwise it should check two levels like foo.com.

Is that right? If so it may be ok, though our list of ccTLDs is slightly underspecified (there are some ccTLDs not in it). Note that my ccTLD list:

Yes. This is exactly what I am proposing.

...

http://spamcheck.freeapp.net/two-level-tlds

is (derived from but) slightly more complete than the one at http://www.bestregistrar.com/help/ccTLD.htm ....

Worst case is that we miss a few ccTLDs. Probably not too big a deal given that most of the spam domains are .com, .biz, etc.

I believe Eric is also making a finer point that other SURBL data sources may miss some unexpected geographic domains where foo.za occurred and only two-level base-ccTLDs like foo.com.za were expected. Not sure how to handle unusual cases like that. I suppose we'll need to relay on the country code authorities to be somewhat consistent with respect to what levels they will allow in their ccTLD.

Philosophical point: it's always possible that some spam domains slip through the cracks, but if that happens often enough and we spot them, we can always blacklist them manually. Perfection may not be possible, but we're certainly greatly increasing the spam detection rates with this approach overall.

My only concern is that we leave a wide enough of a hole that we end of playing catch-up and spammers run through various ccTLDs that we have mis-classified using them for links.

...

BTW I'm using the ccTLD list to try to ensure that any two level ccTLDs do *not* get into any SURBLs.

P.S. ***If anyone has more-complete ccTLD lists or any updates or additions, please share them,*** else spammers may set up shop in some unknown Outer Mongolian ccTLD, and we may not catch them. :-)

Jeff C.

Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss

Jeff Chan

12:43 a.m.

On Wednesday, April 21, 2004, 3:12:58 PM, Eric Kolve wrote:

...

On Wed, Apr 21, 2004 at 03:00:52PM -0700, Jeff Chan wrote:

...
On Wednesday, April 21, 2004, 12:21:16 PM, John Fawcett wrote:

...
From: "Eric Kolve"

...
Initially, when I released spamcopuri I decided to pretty much ignore whether the TLD was a country code or not. This results in about twice as many queries as necessary, but guaranteed you would get hits if the domain was listed.

Now that people are pointing this to other RBL's beside just surbl, should we continue to do second and third level queries? Or just the query that we assume to be necessary? My concern is that not all RBLs will process the domains according to a list such as http://www.bestregistrar.com/help/ccTLD.htm. I suppose the worst case scenario is we end up getting a miss when we should be getting a hit because one side presumes that say TLD .za has a subdomain 'foo', when the server doesn't. The server side would expect a second level,

while

...
the client would do a third level query (this is why I wanted the wildcard records). I guess this really isn't that great a consequence considering the savings and the fact that this shouldn't occur very often.

I will go ahead and make this change if everyone is comfortable with the known risk.

I think if an rhsbl is listing a second level registry domain (like .co.uk) then I think it's up to the list maintainer to implement the wild card so that xxxxx.co.uk returns an A record. I wouldn't worry about taking into account such an extreme case, since I cannot imagine any list wanting to do such widespread blocking.

Yes, the two level ccTDLs like co.uk should never get into a SURBL. Only registrar-type domains should, like foo.co.uk.

...
I believe there should be a mechanism which distinguishes whether a second or third level lookup is required based on a static lists of domains known to have or not have subdomains. If nothing is known then the default should be to check both second and third as at present.

Aha, now I think I understand what's being proposed.

Currently SpamCopURI checks all domains at the second and third level against a given SURBL, regardless of whether the domain is in a ccTLD or not.

It sounds like Eric is proposing a change, where if a domain is in the ccTLD list like co.uk, then the client should try extract and check a three level domain like foo.co.uk. Otherwise it should check two levels like foo.com.

Is that right? If so it may be ok, though our list of ccTLDs is slightly underspecified (there are some ccTLDs not in it). Note that my ccTLD list:

...

Yes. This is exactly what I am proposing.

Kewl. Sounds good to me. I'm cc'ing the SpamAssassin devlopers to compare notes on how they're handling ccTLDs in message body URI checks.

...

...
http://spamcheck.freeapp.net/two-level-tlds

is (derived from but) slightly more complete than the one at http://www.bestregistrar.com/help/ccTLD.htm ....

Worst case is that we miss a few ccTLDs. Probably not too big a deal given that most of the spam domains are .com, .biz, etc.

I believe Eric is also making a finer point that other SURBL data sources may miss some unexpected geographic domains where foo.za occurred and only two-level base-ccTLDs like foo.com.za were expected. Not sure how to handle unusual cases like that. I suppose we'll need to relay on the country code authorities to be somewhat consistent with respect to what levels they will allow in their ccTLD.

Philosophical point: it's always possible that some spam domains slip through the cracks, but if that happens often enough and we spot them, we can always blacklist them manually. Perfection may not be possible, but we're certainly greatly increasing the spam detection rates with this approach overall.

...

My only concern is that we leave a wide enough of a hole that we end of playing catch-up and spammers run through various ccTLDs that we have mis-classified using them for links.

Aha, but if a domain is not in the ccTLD list, won't we check it on two levels on both the client and server sides and therefore catch it?

In other words if somenewspamdomain.bg comes up, and it's not in our ccTLD list, our client and server progams will automatically test it as a two level domain and eventually catch it. In that case I think we're ok, and the only danger is blocking new legitimate two level ccTLDs that we're not yet aware of like newlegitimatetld.bg .

Jeff C.

Jeff Chan

21 Apr 21 Apr

11:36 p.m.

On Wednesday, April 21, 2004, 7:37:47 AM, Eric Kolve wrote:

...

Initially, when I released spamcopuri I decided to pretty much ignore whether the TLD was a country code or not. This results in about twice as many queries as necessary, but guaranteed you would get hits if the domain was listed.

...

Now that people are pointing this to other RBL's beside just surbl, should we continue to do second and third level queries? Or just the query that we assume to be necessary? My concern is that not all RBLs will process the domains according to a list such as http://www.bestregistrar.com/help/ccTLD.htm. I suppose the worst case scenario is we end up getting a miss when we should be getting a hit because one side presumes that say TLD .za has a subdomain 'foo', when the server doesn't. The server side would expect a second level, while the client would do a third level query (this is why I wanted the wildcard records). I guess this really isn't that great a consequence considering the savings and the fact that this shouldn't occur very often.

...

I will go ahead and make this change if everyone is comfortable with the known risk.

Not sure I'm understanding the proposal. Remember that the goal should be for the clients to check the registrar-type base domain against the RBL. If foo.co.uk is the registered domain then that's what the client should extract and it's what the RBL should have if there's to be a match.

Please clue me in. ;-)

Jeff C.

Jose-Marcio.Martins＠ensmp.fr

11:54 p.m.

Jeff Chan wrote:

...

On Wednesday, April 21, 2004, 7:37:47 AM, Eric Kolve wrote:

...
Initially, when I released spamcopuri I decided to pretty much ignore whether the TLD was a country code or not. This results in about twice as many queries as necessary, but guaranteed you would get hits if the domain was listed.

...
Now that people are pointing this to other RBL's beside just surbl, should we continue to do second and third level queries? Or just the query that we assume to be necessary? My concern is that not all RBLs will process the domains according to a list such as http://www.bestregistrar.com/help/ccTLD.htm. I suppose the worst case scenario is we end up getting a miss when we should be getting a hit because one side presumes that say TLD .za has a subdomain 'foo', when the server doesn't. The server side would expect a second level, while the client would do a third level query (this is why I wanted the wildcard records). I guess this really isn't that great a consequence considering the savings and the fact that this shouldn't occur very often.

...
I will go ahead and make this change if everyone is comfortable with the known risk.

Not sure I'm understanding the proposal. Remember that the goal should be for the clients to check the registrar-type base domain against the RBL. If foo.co.uk is the registered domain then that's what the client should extract and it's what the RBL should have if there's to be a match.

Yeah. But you did the assumption that all the rules are defined at some pages told before (bestregistrar, ...).

But there are exceptions. For example, brazilian domains. Most brazilian domains have three components, but not all. E.g. "cta.br" and "ita.br". These aren't spammers, but an engineering school and a research center from Brazilian Air Force.

Your assumptions are based on the fact that you know all the rules. But this isn't true. On the other side, I don't know how much significants are all the exceptions.

Best,

Jose-Marcio

...

Please clue me in. ;-)

Jeff C.

Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss

-- --------------------------------------------------------------- Jose Marcio MARTINS DA CRUZ Tel. :(33) 01.40.51.93.41 Ecole des Mines de Paris http://j-chkmail.ensmp.fr 60, bd Saint Michel http://www.ensmp.fr/~martins 75272 - PARIS CEDEX 06 mailto:Jose-Marcio.Martins@ensmp.fr

Jeff Chan

22 Apr 22 Apr

12:06 a.m.

On Wednesday, April 21, 2004, 2:54:01 PM, Jose-Marcio Martins wrote:

...

Yeah. But you did the assumption that all the rules are defined at some pages told before (bestregistrar, ...).

...

But there are exceptions. For example, brazilian domains. Most brazilian domains have three components, but not all. E.g. "cta.br" and "ita.br". These aren't spammers, but an engineering school and a research center from Brazilian Air Force.

Thanks. I just added those to our two-level-tld list. Got any more? :-)

...

Your assumptions are based on the fact that you know all the rules. But this isn't true.

Yes we don't know everything, but we need to make certain assumptions to be able to write code. Hopefully those assumptions are not too far removed from reality. So far the results suggest that they are not.

That said, the handling of ccTLDs is certainly somewhat open-ended. Countries and registrars can change their policies at any time. So it will take some minor effort to watch for changes.

Jeff C.

Eric Kolve

12:17 a.m.

On Wed, Apr 21, 2004 at 03:06:52PM -0700, Jeff Chan wrote:

...

On Wednesday, April 21, 2004, 2:54:01 PM, Jose-Marcio Martins wrote:

...
Yeah. But you did the assumption that all the rules are defined at some pages told before (bestregistrar, ...).

...
But there are exceptions. For example, brazilian domains. Most brazilian domains have three components, but not all. E.g. "cta.br" and "ita.br". These aren't spammers, but an engineering school and a research center from Brazilian Air Force.

Not to beat a dead horse, but one benefit of having a wildcard A record would be that we only have to keep this kind of logic in one place (surbl). As it stands, both the client and the server need to keep their rules in sync in order for queries to hit. With the wildcard a client could query for the entire domain without worrying about what constitutes a ccTLD or not.

--eric

...

Thanks. I just added those to our two-level-tld list. Got any more? :-)

...
Your assumptions are based on the fact that you know all the rules. But this isn't true.

Yes we don't know everything, but we need to make certain assumptions to be able to write code. Hopefully those assumptions are not too far removed from reality. So far the results suggest that they are not.

That said, the handling of ccTLDs is certainly somewhat open-ended. Countries and registrars can change their policies at any time. So it will take some minor effort to watch for changes.

Jeff C.

Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss

Jeff Chan

2:06 a.m.

On Wednesday, April 21, 2004, 3:17:50 PM, Eric Kolve wrote:

...

On Wed, Apr 21, 2004 at 03:06:52PM -0700, Jeff Chan wrote:

...
On Wednesday, April 21, 2004, 2:54:01 PM, Jose-Marcio Martins wrote:

...
Yeah. But you did the assumption that all the rules are defined at some pages told before (bestregistrar, ...).

...
But there are exceptions. For example, brazilian domains. Most brazilian domains have three components, but not all. E.g. "cta.br" and "ita.br". These aren't spammers, but an engineering school and a research center from Brazilian Air Force.

Oh I just realized I misread Jose-Marcio's comment that cta.br and ita.br are not TLDs but regular domains under an otherwise two level ccTLD system (.br). So I've taken them off the two-level-tld list and moved them to the regular whitelist. (Functionally there's not much difference, but it's good to keep the spirit of the lists correct in usage.)

...

Not to beat a dead horse, but one benefit of having a wildcard A record would be that we only have to keep this kind of logic in one place (surbl). As it stands, both the client and the server need to keep their rules in sync in order for queries to hit. With the wildcard a client could query for the entire domain without worrying about what constitutes a ccTLD or not.

I'm not understanding how wildcarding would help. Can you give an example?

The only thing I can think of is that the lack of a wildcard would say that a (parent) domain was ok, whereas a wildcard would say that subdomains (child domains) are spammy. Probably that interpretation will seem wrong given some good examples or further explanation. :-)

Jeff C.

7746

Age (days ago)

7747

Last active (days ago)

discuss@lists.surbl.org

9 comments

4 participants

tags (0)

participants (4)

Eric Kolve
Jeff Chan
John Fawcett
Jose-Marcio.Martins＠ensmp.fr