Re: Standardizing on IDNA 2003 in the URL Standard

--On Friday, August 23, 2013 12:19 +0200 Mark Davis ☕
<mark@macchiato.com> wrote:

> There are two different issues.
> 
> A. The mapping is purely a client-side issue, and is allowed
> by IDNA2008. So that is not a problem for compatibility.

Agreed, with a few qualifications.   First, for reasons
explained by others in this thread, IDNA2008 allows mapping to
correspond to well-understood local needs.  Global and
non-selective use of the same mapping in every instance of a
particular browser, or by all browsers, is inconsistent with
that intent.  That distinction is purely philosophical in the
vast majority of cases but may be quite important to the
exceptions; we should not lose track of it.  Second, UTE46 uses
the terms "ToASCII" and "toUnicode" to describe operations that
are subtly different from the "ToASCII" and "ToUnicode" of
IDNA2003.  That invites a different type of confusion and claims
of compatibility where interoperation doesn't exist.  IMO, UTR46
and our general situation would benefit from changes in that
terminology.  In addition, while a large and important fraction
of IDNA2003's Nameprep profile of StringPrep is identical to
NFKC compatibility mapping, that is NFKC mapping as of Unicode
3.2.  Even if one uses UTR46 or some other set of rules to
preserve the Unicode 3.2-based NFKC mappings, it would probably
be appropriate to have a serious discussion of whether the needs
of the user and implementer communities are better served by
applying NFKC (and potentially Case Folding) to characters added
after Unicode 3.2.  In addition to the purely theoretical
concerns, NFKC maps certain little-used Han characters onto
others.  IDNA2008 disallows those characters, leaving the option
of permitting some of them (with little disruption) open in the
future if the language communities are convinced that they are
important.  Mapping them out as soon as they appear in Unicode
would then leave us with a new version of the Eszett problem as
well as the risk that IDNA and UTR46 would diverge on how they
are handled.

> The most important feature of 'no mapping' IMO is on the
> registry side: to make certain that registries either disallow
> mapping during the registration process, or that they very
> clearly show that the resulting domain name is different than
> what the user typed. While an orthogonal issue to the
> client-side we're discussing here, it is worth a separate
> initiative.

Agreed.  Most of that initiative has been underway since before
IDAN2008 was approved although application to FQDNs raises other
issues (see below).

> B. The transitional incompatibilities are:
> 
>    1. Non-letter support
>    2. 4 deviation characters
> 
> Both of these are just dependent on registry adoption. The
> faster that happens, the shorter the transition period can be.
> Note the transition for each of these is independent, and can
> proceed on a different timescale. Moreover, terminating the
> transition period doesn't need all registries to buy in.

Good.  The question is how many.  I wish, probably more often
than most, that the situation was still as it was when (and
before) RFC 1591 was published in 1994 and it was realistic to
believe that a requirement could be imposed, top-down and
recursively, on all DNS nodes.  That situation no longer exists,
with decisions being made on the basis of short-term economic
interests (including the costs of trying to monitor and enforce
rules) and others being made because "registries" (zone
administrators) are too busy with other priorities to pay
attention.  That, in turn, leaves us with a nasty
chicken-and-egg problem:  from one point of view, it is easy to
say "transition ends when most of the registries enforce the
IDNA2008 rules".  From another, the problem looks more like
"most registries will enforce the rules only when not doing so
becomes painful, i.e., when their users/customers complain that
the names they are using are not predictably accessible".  If we
end up with an environment in which everyone is waiting for
everyone else, the losers are the users of the Internet.

>    1. The TR46 non-letter support can be dropped in clients
> once the major    registries disallow non-IDNA2008 URLs. I say
> URLs, because the registries    need to not only disallow them
> in SLDs (eg http://☃.com), they *also*need to forbid their
> subregistries from having them in Nth-level domains
>    (that is, disallow http://☃.blogspot.ch/ =
> xn--n3h.blogspot.ch). 

See above.  It is a reality of our current situation that
"forbidding" for the DNS is ineffective, just as an effort by
IETF to "require" conformance to its standards or one by the
Unicode consortium to "forbid" applications from designing and
quietly adopting and applying a fifth normalization form would
be ineffective.  We can, at most, try to persuade.

Also, as part of my mini-campaign for consistent terminology and
its consistent use, the DNS community would describe what you
are talking about as full-qualified domain names (FQDNs) in the
domain-part of URLs.   When you use the term "URL" instead, you
include the path, query, and fragment parts of URLs.   As others
have pointed out, the use of non-ASCII characters is popular in
those tail elements in many parts of the world and queries can,
and often do, contain domain names.  To the extent that is a
problem, it is not our problem -- neither IDNA2003 (including
RFC 5895) nor UTR 46 address it.

>   2. The TR46 deviation character
> support can be dropped in clients once    the major registries
> that allow them provide a bundle or block approach to
> labels that include them, so that new clients can be
> guaranteed that URLs    won't go to a different location than
> they would under IDNA2003. The    bundle/block needs to last
> while there are a significant number of IDNA2003    clients
> out in the world. Because newer browsers have automatic
> updates,    this can be far faster than it would have been a
> few years ago.

As a strategy, I believe that "bundle or block" is the right
thing to do and that it would be better to not have similar
FQDNs that identify different systems ("go to different
locations" is a little web-specific for my taste).  However,
that is part of the far more general set of "similarity",
"confusability", and "variant" problems that continue to tie
ICANN in knots.  Viewing the handful of "deviation characters"
as special involves picking out a tiny fraction of the problem
and assuming it is worth solving separately.  Many of the
entities that have to deal with the whole system, including
ICANN and many "major registries", just don't see things that
way because they see any general adoption of "bundle or block"
rules as involving important economic and user demand tradeoffs,
not as a technical matter associated with IDNA2003-> IDNA2008
transition.
 
I have one other major concern about UTR 46.  More because of
the way it is written, with its own tables and operations and
use of IDNA2003 terminology, rather than its intent, it can
easily be interpreted as a substitute for IDNA2008 (with the
latter used only as a final check on label validity) rather than
a mapping and transitional add on for it.  Since many of us seem
to be in agreement that it should ultimately be a collection of
IDNA2008-conformant mapping rules, it seems to me that the
specification would be stronger if it were constructed more as a
"migrating to IDNA2008" one than as a "migrating [reluctantly?]
away from IDNA2003" one.  Changing the terminology and tone a
bit could go a long way in that direction.

Again, I see most of these issues as being more about details
and presentation than about fundamentals.  If Mark were
interested in forming a small editorial group to make changes
along the lines I've outlined, and thought it would be useful,
I'd be happy to join in the effort.

best,
    john

Received on Friday, 23 August 2013 15:13:43 UTC