Re: 8 bit characters in DNS names (and URNs?)

Alexander Dupuy (dupuy@smarts.com)
Tue, 5 Mar 1996 15:26:03 -0500


Date: Tue, 5 Mar 1996 15:26:03 -0500
From: dupuy@smarts.com (Alexander Dupuy)
Message-Id: <9603052026.AA01258@just.smarts.com>
To: martin@terena.nl, wg-i18n@terena.nl, keld@dkuug.dk
Subject: Re: 8 bit characters in DNS names (and URNs?)
Cc: uri@bunyip.com

> I am the editor of an ISO standard where we are defining
> a format for cultural conventions building on the POSIX
> locales and charmaps. Included will be a standard locale with
> mapping tables between lower and upper case for the whole of 10646. 
> This locale will be freely available on the net together with
> charmaps more than 100 coded character sets. Data is already
> available that is similar to this, but not complete yet over full 10646.

I'm glad to hear that this is being done; it is a useful effort.

> Alexander also writes that the upercase mapping is culturally sensitive.
> This is correct, but there is a great majority of cultures 
> that have the same toupper() specifications. In most cultures a
> latin small e with acute is capitalized into a capital e with acute.
> Likewise with a small greek omega - it is capitalized into a capital
> greek omega. The only exception I can think of is in Turkish
> <i without dot> with uppercase <I>, and <i> capitalized into <I with dot>.
> Then some say that in french they never use capitalized accented letters,
> but that seems not to be the rule, according to official French sources.

One definition of "case-insensitive" comparison is "comparison after strings
have been converted to uppercase".  I'm not sure that this is the best
definition for the purposes of DNS domain names or URNs.  For DNS, the initial
motivation for case-insensitive naming may simply have been to avoid fights
between people who believe in proper capitalization of organizations vs. all
lowercase fanatics.  For URNs, case-insensitivity was primarily introduced to
avoid transcription errors; this motivation applies to DNS names as well.

The example of turkish capitalization of i and dotless-i to dotted-I and I
leads me to think that ignoring accent marks is the best thing to do.
Remember that the canonical representation of the URN or DNS name is in the
original (correct) capitalization, so readability shouldn't suffer.  The
tradeoff is increased ability to recognize names even in the face of incorrect
capitalization or accents vs. the ability to have distinct names which differ
only in their capitalization or accents.

For most languages using accent marks, the loss of this distinction shouldn't
be a problem; the only exception I could imagine would be Vietnamese, where
the loss of tonal markings might create an unacceptable number of "false
homonyms".  In the case of Vietnamese, there is a dedicated code set, so it
could be specified that case mapping for this code set (only) would take
accent marks into account.

> I am confident that the uppercase mapping should not be a problem.
> But I am not sure that we should do this just as an enhancement 
> in DNS. Anyway one way to do it would be to say that 
> the entry should be in UTF-8, and we could define a new RR type  to
> do this. URLs could then first look there and if not found look
> in the normal RRs. I am not sure it is the right time to make
> such specifications, though.

Creating a new RR type doesn't have anything to do with this problem, which is
about the case-insensitive comparison of DNS *names*, regardless of the RR
type or the encoding of the contents of RRs.  DNS names are full binary types,
and can be encoded in any way you like (although the only encoding in use
today is ASCII).

UTF-8 encoding might still be helpful here, to avoid problems with multiple
8-bit encodings; for example, an Icelandic domain name containing Latin-1
thorn might appear on a Turkish Latin-3 display as s^ - a naive Turkish user
might leave the accent off the s (knowing that accent marks are ignored) and
be surprised that the name can't be found.  By using a single encoding, the
thorn character which couldn't be displayed properly on the turkish screen
would show up as a hex code.

@alex
--
inet: dupuy@smarts.com
Member of the League for Programming Freedom -- write to lpf@uunet.uu.net
GCS d?@ H s++: !g p? !au a w v US+++$ C++$ P+ 3 L E++ N+(!N) K- W M V- po- Y+
     t+ !5 j R G? tv-- b++ !D B- e>* u+(**)@ h--- f+ r++ n+ y+*