8 bit characters in DNS names (and URNs?) from Alexander Dupuy on 1996-03-04 (uri@w3.org from March 1996)

From: Alexander Dupuy <dupuy@smarts.com>
Date: Mon, 4 Mar 1996 18:10:53 -0500
To: uri@bunyip.com
Message-Id: <9603042310.AA11854@just.smarts.com>
A recently discovered bug in certain applications that didn't validate
hostnames received from DNS has started a flurry of activity on the
bind-workers mailing list concerning means of preventing things like
youre-hosed-`rm -rf /`.crackers.org from being passed through DNS servers.
This is basically a good thing; hostnames are limited to a restricted
character set for good reasons of interoperability, and adding validation of
these restrictions will close security holes.

However, the approaches implemented so far would not only restrict hostname
records in DNS to the limited character set, but in fact, all DNS domain names
(though not the data associated with all RR types).  Paul Vixie, the primary
maintainer of BIND (the most popular DNS implementation) has expressed
willingness to not impose the restrictions on all DNS domain names if and only
if there is a standards-track RFC which specifies exactly which DNS names may
be unrestricted, and how to deal with it.

Now I certainly don't want to reraise the issue of non-ASCII URLs; in any
case, URLs use DNS for hostnames, and hostnames represented in DNS will
certainly be restricted to the limited character set specified in RFCs, no
matter what happens.  On the other hand, recent URN proposals use DNS to
manage the top-level parts of the URN namespace and use new DNS RR types to
implement the URN->URL resolver location process.  Although I feel that
"internationalized" URLs are probably a non-starter, for reasons of
interoperability and the size of the installed base of URL-using applications,
"internationalized" URNs (including the DNS-based part) are still possible,
*if* one problem can be solved.

The problem is one that applies to both DNS and URNs; namely identifiers in
both namespaces are defined to be "case-insensitive" - a laudable human
interface goal that makes URNs less likely to be mistranscribed.  However,
while "case-insensitive" is quite well-defined w.r.t. ASCII, it is not so well
defined w.r.t. ISO 8859 codesets, and is poorly defined w.r.t. arbitrary 8-bit
data.

The first issue that surfaces is that the various ISO 8859 codesets don't have
any consistent grouping of alphabetic vs. non-alphabetic symbols in the upper
half of the codeset.  As a result, each different ISO 8859 codeset has a
different case-folding mapping.  Faced with 8-bit data in an unknown ISO 8859
codeset, there is no consistent case-folding which can be done.  Introducing
16-bit codesets may even make things worse if they don't restrict the use of
octets with the high bit cleared, since there would then be no way to
distinguish between an ASCII character would should be case-folded and a part
of a 16-bit character that shouldn't.  This implies that any solution must
identify the codeset in use; this may be difficult to do for DNS without
introducing incompatibilities with one of the most widely deployed internet
protocols.

Even if you can solve the issue of identifying the codeset in question, the
problem isn't completely solved.  Different cultures have different ideas
about alphabetization of accented letters.  In a culture where accented
letters are alphabetized the same as unaccented letters (just as capitalized
letters are alphabetized the same as lowercase) users might reasonably assume
that a case-insensitive system would treat the accents as optional.  It's not
entirely clear to me what the correct behavior of a "case-insensitive" system
should be, although I lean towards folding all accented characters together
with unaccented characters.

Does anyone on this list have suggestions about how this problem could be
dealt with?  Or alternately, convincing reasons why it's not an important
problem, and can be ignored?

@alex
--
inet: dupuy@smarts.com
Member of the League for Programming Freedom -- write to lpf@uunet.uu.net
GCS d?@ H s++: !g p? !au a w v US+++$ C++$ P+ 3 L E++ N+(!N) K- W M V- po- Y+
     t+ !5 j R G? tv-- b++ !D B- e>* u+(**)@ h--- f+ r++ n+ y+*
Received on Monday, 4 March 1996 18:11:52 UTC