- From: Alexander Dupuy <dupuy@smarts.com>
- Date: Mon, 4 Mar 1996 18:10:53 -0500
- To: uri@bunyip.com
A recently discovered bug in certain applications that didn't validate hostnames received from DNS has started a flurry of activity on the bind-workers mailing list concerning means of preventing things like youre-hosed-`rm -rf /`.crackers.org from being passed through DNS servers. This is basically a good thing; hostnames are limited to a restricted character set for good reasons of interoperability, and adding validation of these restrictions will close security holes. However, the approaches implemented so far would not only restrict hostname records in DNS to the limited character set, but in fact, all DNS domain names (though not the data associated with all RR types). Paul Vixie, the primary maintainer of BIND (the most popular DNS implementation) has expressed willingness to not impose the restrictions on all DNS domain names if and only if there is a standards-track RFC which specifies exactly which DNS names may be unrestricted, and how to deal with it. Now I certainly don't want to reraise the issue of non-ASCII URLs; in any case, URLs use DNS for hostnames, and hostnames represented in DNS will certainly be restricted to the limited character set specified in RFCs, no matter what happens. On the other hand, recent URN proposals use DNS to manage the top-level parts of the URN namespace and use new DNS RR types to implement the URN->URL resolver location process. Although I feel that "internationalized" URLs are probably a non-starter, for reasons of interoperability and the size of the installed base of URL-using applications, "internationalized" URNs (including the DNS-based part) are still possible, *if* one problem can be solved. The problem is one that applies to both DNS and URNs; namely identifiers in both namespaces are defined to be "case-insensitive" - a laudable human interface goal that makes URNs less likely to be mistranscribed. However, while "case-insensitive" is quite well-defined w.r.t. ASCII, it is not so well defined w.r.t. ISO 8859 codesets, and is poorly defined w.r.t. arbitrary 8-bit data. The first issue that surfaces is that the various ISO 8859 codesets don't have any consistent grouping of alphabetic vs. non-alphabetic symbols in the upper half of the codeset. As a result, each different ISO 8859 codeset has a different case-folding mapping. Faced with 8-bit data in an unknown ISO 8859 codeset, there is no consistent case-folding which can be done. Introducing 16-bit codesets may even make things worse if they don't restrict the use of octets with the high bit cleared, since there would then be no way to distinguish between an ASCII character would should be case-folded and a part of a 16-bit character that shouldn't. This implies that any solution must identify the codeset in use; this may be difficult to do for DNS without introducing incompatibilities with one of the most widely deployed internet protocols. Even if you can solve the issue of identifying the codeset in question, the problem isn't completely solved. Different cultures have different ideas about alphabetization of accented letters. In a culture where accented letters are alphabetized the same as unaccented letters (just as capitalized letters are alphabetized the same as lowercase) users might reasonably assume that a case-insensitive system would treat the accents as optional. It's not entirely clear to me what the correct behavior of a "case-insensitive" system should be, although I lean towards folding all accented characters together with unaccented characters. Does anyone on this list have suggestions about how this problem could be dealt with? Or alternately, convincing reasons why it's not an important problem, and can be ignored? @alex -- inet: dupuy@smarts.com Member of the League for Programming Freedom -- write to lpf@uunet.uu.net GCS d?@ H s++: !g p? !au a w v US+++$ C++$ P+ 3 L E++ N+(!N) K- W M V- po- Y+ t+ !5 j R G? tv-- b++ !D B- e>* u+(**)@ h--- f+ r++ n+ y+*
Received on Monday, 4 March 1996 18:11:52 UTC