Re: [charmod-norm] Provide descriptions of Unicode case folding from John C Klensin on 2015-08-06 (www-international@w3.org from July to September 2015)

From: John C Klensin <john+w3c@jck.com>
Date: Thu, 06 Aug 2015 11:48:08 -0400
To: Andrew Sullivan <ajs@anvilwalrusden.com>, www-international@w3.org
Message-ID: <45D3E1B8E143202396C6A505@JcK-HP8200.jck.com>

--On Wednesday, August 05, 2015 15:36 -0400 Andrew Sullivan
<ajs@anvilwalrusden.com> wrote:

>> Long term, if the majority of text in Cherokee is in the
>> (new) lowercase, would it be awkward to force them to use the
>> uppercase for idns?
> 
> Well, at the moment, IDNA2008 is frozen in pre-Unicode-7
> because of a different problem, so the issue will be academic
> until then.

It is perhaps worth remembering that we have been through almost
the same thing before, ending up in a disagreement between the
strong preference of the user community (who were more numerous
than speakers of Cherokee) and advocates of what I hope I'm not
mischaracterizing as "stability and forward and backward
compatibility no matter what".

For those who don't know the example, the earlier version of
IDNA supported (and required) case mapping, based on the Unicode
"language independent case-folding" algorithm.  Because there
was no upper case version of the character "Sharp S" (Eszett,
U+00DF), that mapping process turned it into the common basic
Latin representation, "ss", effecting making Eszett unusable in
IDNA domain names even though users could successfully type it
in many contexts.  During the time IDNA2008 was being developed,
mandatory mappings were dropped to guarantee an one-one mapping
between natural character ("U-label") and ASCII-encoded
("A-label", Punycode-encoded) forms and a code point was
assigned to an uppercase representation of Eszett.  The latter
could have been used to case-fold U+00DF to itself but did not
for stability reasons.  With considerable guidance (one might
even say "pressure") from the German-speaking community
including both users and DNS registrars and registries in
Germany, the IDNA WG decided to allow Eszett as a permitted
character in IDN labels, thereby creating an incompatibility
with strings that apparently contained Eszett but where it was
mapped to "ss" under IDNA2003 and a consequent transition
problem.  

At least in part because one of the recommendations about how to
handle that transition has been widely interpreted as "just
don't do it, continue to map Eszett to 'ss' forever", describing
that change as "awkward" or "disruptive" would probably
understatements. 

> But a major change to case folding behaviour between Unicode
> versions would be pretty disruptive to any identifier system,
> yeah.

Given the above difficulties caused by a single character
change, the consequences of a change for an entire script if the
same pattern were repeated are hard to contemplate.  While I
hope we can do better, the odds are that the same process would
play out: the Cherokee user community would want lower-case for
consistency with familiar patterns and everyone else, the DNS
community would be likely to listen to the demands of their
likely customers (the Cherokee user community and those trying
to appeal to them) and would note that the number of present
registrations in Cherokee is quite low relative to their
projections and expectations), and the parts of the web browser
and developer communities who believe in absolute stability
would apply that view.

So, "pretty disruptive" indeed.

    john

Received on Thursday, 6 August 2015 15:48:39 UTC