Simon Byrnand, Eric Kolve and I were having a discussion of what characters are legal in domain names, due to junk showing up around URIs and apparently confusing some of the SpamAssassin URI parsing code. Wanted to share some research and ask if anyone has any other authoritative information on what characters are currently legal for domain names. This is relevant for anyone trying to work with domain names.
Also Eric, please share bugs you find in the SA URI parsing code, preferably by opening a bugzilla, especially if you can isolate the module, etc.:
http://bugzilla.spamassassin.org/enter_bug.cgi
Here's a little research on the subject:
The original domain name RFC had names only with letters, numbers and hyphen:
http://www.ietf.org/rfc/rfc1035.txt
<domain> ::= <subdomain> | " "
<subdomain> ::= <label> | <subdomain> "." <label>
<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
<let-dig-hyp> ::= <let-dig> | "-"
<let-dig> ::= <letter> | <digit>
<letter> ::= any one of the 52 alphabetic characters A through Z in upper case and a through z in lower case
<digit> ::= any one of the ten digits 0 through 9
Note that while upper and lower case letters are allowed in domain names, no significance is attached to the case. That is, two names with the same spelling but different case are to be treated as if identical.
But RFC 2181 leaves things wide open with respect to names:
http://www.ietf.org/rfc/rfc2181.txt
11. Name syntax
Occasionally it is assumed that the Domain Name System serves only the purpose of mapping Internet host names to data, and mapping Internet addresses to host names. This is not correct, the DNS is a general (if somewhat limited) hierarchical database, and can store almost any kind of data, for almost any purpose.
The DNS itself places only one restriction on the particular labels that can be used to identify resource records. That one restriction relates to the length of the label and the full name. The length of any one label is limited to between 1 and 63 octets. A full domain name is limited to 255 octets (including the separators). The zero length full name is defined as representing the root of the DNS tree, and is typically written and displayed as ".". Those restrictions aside, any binary string whatever can be used as the label of any resource record. Similarly, any binary string can serve as the value of any record that includes a domain name as some or all of its value (SOA, NS, MX, PTR, CNAME, and any others that may be added). Implementations of the DNS protocols must not place any restrictions on the labels that can be used. In particular, DNS servers must not refuse to serve a zone because it contains labels that might not be acceptable to some DNS client programs. A DNS server may be configurable to issue warnings when loading, or even to refuse to load, a primary zone containing labels that might be considered questionable, however this should not happen by default.
Note however, that the various applications that make use of DNS data can have restrictions imposed on what particular values are acceptable in their environment. For example, that any binary label can have an MX record does not imply that any binary name can be used as the host part of an e-mail address. Clients of the DNS can impose whatever restrictions are appropriate to their circumstances on the values they use as keys for DNS lookup requests, and on the values returned by the DNS. If the client has such restrictions, it is solely responsible for validating the data from the DNS to ensure that it conforms before it makes any use of that data.
After scanning the RFC descriptions that were linked from RFC 1035:
RFC 1035 Domain names - implementation and specification.
Authors: P.V. Mockapetris. Date: Nov-01-1987 Formats: txt pdf Obsoletes: RFC 0973, RFC 0882, RFC 0883 Updated by: RFC 1101, RFC 1183, RFC 1348, RFC 1876, RFC 1982, RFC 1995, RFC 1996, RFC 2065, RFC 2136, RFC 2181, RFC 2137, RFC 2308, RFC 2535, RFC 2845, RFC 3425, RFC 3658 Also: STD 0013
it appears that these may be the only two authoritative statements on what characters can be in domain names:
RFC 1035: letters, numbers, hyphen RFC 2181: implementations should support anything
Does anyone have any more info on what characters are legal in domain names?
Jeff C.