Re: Standardizing on IDNA 2003 in the URL Standard from John C Klensin on 2013-08-20 (public-iri@w3.org from August 2013)

From: John C Klensin <klensin@jck.com>
Date: Tue, 20 Aug 2013 15:33:45 -0400
To: Marcos Sanz <sanz@denic.de>, Anne van Kesteren <annevk@annevk.nl>
cc: Shawn Steele <Shawn.Steele@microsoft.com>, public-iri@w3.org, uri@w3.org, Peter Saint-Andre <stpeter@stpeter.im>, Mark Davis ☕ <mark@macchiato.com>, idna-update@alvestrand.no, Vint Cerf <vint@google.com>, "www-tag.w3.org" <www-tag@w3.org>
Message-ID: <9B194C30AF1839167A9EA490@JcK-HP8200.jck.com>
--On Tuesday, August 20, 2013 15:55 +0200 Marcos Sanz
<sanz@denic.de> wrote:

> idna-update-bounces@alvestrand.no wrote on 20/08/2013 14:32:23:
> 
>> On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
>> <Shawn.Steele@microsoft.com> wrote:
>> > I concur.  We use the IDNA2008 + TR46 behavior.
>> 
>> Interesting. Last I checked Internet Explorer that was not
>> the case.
> 
> At this side of the keyboard, ß is still not supported in
> IE10/Win7-SP1

But that is completely consistent with IDNA2008 + UTR46 when the
most IDOA2003-like profile (or, if you prefer, stage of
transition) of UTR46 is used.   One can debate endlessly whether
UTF46 is a good idea (and the IDNABIS WG did), but ultimately
[1] it was intended to provide an environment as much like that
of IDNA2003 as possible.  That includes:
 
--strict backward compatibility with the interpretation
 of strings that are valid with either IDNA2003 or
 IDNA2008   and 
 
-- continued support for strings that were valid in
 IDNA2003 but that mapped into other strings before being
 converted using ASCII strings using Punycode where those
 target strings are valid under IDNA2008

If one accepts that kind of compatibility as a primary goal,
then the fact that "ß" was mapped to "ss" in IDNA2003 means
that mapping must be preserved forever and one will never [2]
actually be able to store an Eszett in the DNS.  

The bottom line, at least IMO, is that one can adopt either of
two philosophical models.   In one, whatever decisions were made
in building the IDNA2003 standard and the name strings those
decisions allowed are inviolable.  Arguments that errors were
made, that those strings create risks, or that the rules
prohibit orthographically-reasonable strings are simply
irrelevant if they conflict with absolute compatibility.  The
other(at the risk of showing my biases) is to assume that we are
human, that mistakes will get made, and that, if they are
significant, we should figure out how to correct them and move
on.  

As others have suggested, the latter includes realizing that
some labels and practices that were allowed under IDNA2003 were
simply a bad idea and we should move away from them as soon as
possible rather than encouraging their use in even more
contexts.  Coming back to the comment that started this note, it
also means that, if the relevant language communities decide,
for example, that Eszett is important as a character or that
zero-width joiners and non-joiners are critical, we need to
figure out how to accommodate them even if the accommodation is
not perfect and doesn't solve all problems.  And, in each case,
we need to remember that the Internet is growing and reaching
more communities and more people within almost every community,
making transition now, even if painful, much less painful than
transition in the future.

FWIW, without at least some measure of the latter model, we
would be stuck with HTTP 1.0, HTML 1 (or at least 3), and ISO
8859-1 forever.  The decision to interpret a string of non-ASCII
octets in content as, by default, a good candidate for UTF-8
rather than Latin-1 is, at least IMO, ultimately an incompatible
change of far more sweeping impact and consequences than this
IDNA2003 -> IDNA2008 transition.

In an odd way, while I would have preferred to see a much more
rapid transition, I think that exactly what should be happening
is happening.  The various registries --both the
ICANN-supervised ones and many others at the root and various
other levels-- are prohibiting (and not renewing) strings that
do not conform with IDNA2008.  Registries that want to support
labels that are problematic from a transition standpoint have
devised, or are devising, procedures to lower the odds of
strings that pose difficulties falling into hostile hands, just
as many of them do for potentially-confusing strings.  The right
time to transition systems that look up names involves tricky
questions including the "pain now or more pain later"
considerations mentioned above.   And where UTR 46 and/or RFC
5895 fit into transition strategies (as distinct from localized
mapping strategies), or not, is obviously part of that
transition question.

Anne, coming back to your original question, I don't know what
question you and your colleagues asked that got the "everyone is
still on IDNA2003" answer.  Especially given the information
from Microsoft, I suspect it was close to "are you fully
supporting IDNA2008" for which as "no" answer might lead to a
"using IDNA2003" answer despite their telling us that they are
running IDNA2008 with UTR 46.  Others have pointed out that
"IDNA2003 with the version restriction eliminated" may be a
sensible statement in individual cases but, because the Nameprep
profile of Stringprep is not simply Unicode Case Folding plus
NFKC, it leaves enough open to local interpretation that it is
not a plausible candidate for a statement in a standard that is
intended to promote interoperability. 

Against that backdrop, I believe you should interpret what you
are seeing, not as "everyone is committed to IDNA2003"
(obviously not true as soon as exceptions are introduced) and
"IDNA2003 with exceptions forever" but as slow transition.  If
you want a standard that works going forward, make the
assumption that the folks who designed IDNA2008 were not fools
and that browsers should be moving, and eventually will move
(unless you discourage them) in the IDNA2008 direction.  Whether
you want to discuss transition or not is up to you.  If you want
to follow Mark's recommendation (and Microsoft's lead) and
suggest IDNA2008 plus UTR 46, I suggest you do so in a way that
really constitutes a transition strategy rather than an "IDNA
2003 forever" one, i.e., that you address the issues of when
"transition processing" gets turned off and the localization
issues (especially about case folding) mentioned by others.  If
not, you and your working group put us all at risk of many
internationalized email applications working differently than
web browsers do, in a fork between IETF and W3C i18n standards,
divergence between assumptions and norms used by those who
create DNS names and those who look them up, and so on.  I hope
we can agree that those would be bad outcomes.

regards,
    john

 -----------

[1] I hope Mark will more or less agree with this
characterization; it is a accurate and neutral as I know how to
make it.

[2[ This is associated with one of the key criticisms of UTR 46
that has not been discussed so far:  It has been described as a
transition strategy, but there is really no mechanism in it for
deciding when to adopt the IDNA2008 model and rules in favor of
strict backward-compatibility with as many names that were valid
under IDNA2003 as possible.   In reality, saying "we use UTR 46"
or "we conform to UTR 46" is somewhat underspecified because UTR
46 can be used strictly for local mapping, with what it calls
"transition processing" (which is where Eszett disappears),
and/or with other optional features such as flagging, but
continuing to look up, strings that contain punctuation or
symbol characters.  Either of those latter options makes a
so-called "IDNA2008 + UTR46" implementation non-conforming with
IDNA2008.
Received on Tuesday, 20 August 2013 19:34:25 UTC