- From: John C Klensin <klensin@jck.com>
- Date: Fri, 23 Aug 2013 11:13:09 -0400
- To: Mark Davis ☕ <mark@macchiato.com>, Vint Cerf <vint@google.com>
- cc: Anne van Kesteren <annevk@annevk.nl>, IDNA update work <idna-update@alvestrand.no>, "PUBLIC-IRI@W3.ORG" <public-iri@w3.org>, uri@w3.org, "www-tag.w3.org" <www-tag@w3.org>
--On Friday, August 23, 2013 12:19 +0200 Mark Davis ☕ <mark@macchiato.com> wrote: > There are two different issues. > > A. The mapping is purely a client-side issue, and is allowed > by IDNA2008. So that is not a problem for compatibility. Agreed, with a few qualifications. First, for reasons explained by others in this thread, IDNA2008 allows mapping to correspond to well-understood local needs. Global and non-selective use of the same mapping in every instance of a particular browser, or by all browsers, is inconsistent with that intent. That distinction is purely philosophical in the vast majority of cases but may be quite important to the exceptions; we should not lose track of it. Second, UTE46 uses the terms "ToASCII" and "toUnicode" to describe operations that are subtly different from the "ToASCII" and "ToUnicode" of IDNA2003. That invites a different type of confusion and claims of compatibility where interoperation doesn't exist. IMO, UTR46 and our general situation would benefit from changes in that terminology. In addition, while a large and important fraction of IDNA2003's Nameprep profile of StringPrep is identical to NFKC compatibility mapping, that is NFKC mapping as of Unicode 3.2. Even if one uses UTR46 or some other set of rules to preserve the Unicode 3.2-based NFKC mappings, it would probably be appropriate to have a serious discussion of whether the needs of the user and implementer communities are better served by applying NFKC (and potentially Case Folding) to characters added after Unicode 3.2. In addition to the purely theoretical concerns, NFKC maps certain little-used Han characters onto others. IDNA2008 disallows those characters, leaving the option of permitting some of them (with little disruption) open in the future if the language communities are convinced that they are important. Mapping them out as soon as they appear in Unicode would then leave us with a new version of the Eszett problem as well as the risk that IDNA and UTR46 would diverge on how they are handled. > The most important feature of 'no mapping' IMO is on the > registry side: to make certain that registries either disallow > mapping during the registration process, or that they very > clearly show that the resulting domain name is different than > what the user typed. While an orthogonal issue to the > client-side we're discussing here, it is worth a separate > initiative. Agreed. Most of that initiative has been underway since before IDAN2008 was approved although application to FQDNs raises other issues (see below). > B. The transitional incompatibilities are: > > 1. Non-letter support > 2. 4 deviation characters > > Both of these are just dependent on registry adoption. The > faster that happens, the shorter the transition period can be. > Note the transition for each of these is independent, and can > proceed on a different timescale. Moreover, terminating the > transition period doesn't need all registries to buy in. Good. The question is how many. I wish, probably more often than most, that the situation was still as it was when (and before) RFC 1591 was published in 1994 and it was realistic to believe that a requirement could be imposed, top-down and recursively, on all DNS nodes. That situation no longer exists, with decisions being made on the basis of short-term economic interests (including the costs of trying to monitor and enforce rules) and others being made because "registries" (zone administrators) are too busy with other priorities to pay attention. That, in turn, leaves us with a nasty chicken-and-egg problem: from one point of view, it is easy to say "transition ends when most of the registries enforce the IDNA2008 rules". From another, the problem looks more like "most registries will enforce the rules only when not doing so becomes painful, i.e., when their users/customers complain that the names they are using are not predictably accessible". If we end up with an environment in which everyone is waiting for everyone else, the losers are the users of the Internet. > 1. The TR46 non-letter support can be dropped in clients > once the major registries disallow non-IDNA2008 URLs. I say > URLs, because the registries need to not only disallow them > in SLDs (eg http://☃.com), they *also*need to forbid their > subregistries from having them in Nth-level domains > (that is, disallow http://☃.blogspot.ch/ = > xn--n3h.blogspot.ch). See above. It is a reality of our current situation that "forbidding" for the DNS is ineffective, just as an effort by IETF to "require" conformance to its standards or one by the Unicode consortium to "forbid" applications from designing and quietly adopting and applying a fifth normalization form would be ineffective. We can, at most, try to persuade. Also, as part of my mini-campaign for consistent terminology and its consistent use, the DNS community would describe what you are talking about as full-qualified domain names (FQDNs) in the domain-part of URLs. When you use the term "URL" instead, you include the path, query, and fragment parts of URLs. As others have pointed out, the use of non-ASCII characters is popular in those tail elements in many parts of the world and queries can, and often do, contain domain names. To the extent that is a problem, it is not our problem -- neither IDNA2003 (including RFC 5895) nor UTR 46 address it. > 2. The TR46 deviation character > support can be dropped in clients once the major registries > that allow them provide a bundle or block approach to > labels that include them, so that new clients can be > guaranteed that URLs won't go to a different location than > they would under IDNA2003. The bundle/block needs to last > while there are a significant number of IDNA2003 clients > out in the world. Because newer browsers have automatic > updates, this can be far faster than it would have been a > few years ago. As a strategy, I believe that "bundle or block" is the right thing to do and that it would be better to not have similar FQDNs that identify different systems ("go to different locations" is a little web-specific for my taste). However, that is part of the far more general set of "similarity", "confusability", and "variant" problems that continue to tie ICANN in knots. Viewing the handful of "deviation characters" as special involves picking out a tiny fraction of the problem and assuming it is worth solving separately. Many of the entities that have to deal with the whole system, including ICANN and many "major registries", just don't see things that way because they see any general adoption of "bundle or block" rules as involving important economic and user demand tradeoffs, not as a technical matter associated with IDNA2003-> IDNA2008 transition. I have one other major concern about UTR 46. More because of the way it is written, with its own tables and operations and use of IDNA2003 terminology, rather than its intent, it can easily be interpreted as a substitute for IDNA2008 (with the latter used only as a final check on label validity) rather than a mapping and transitional add on for it. Since many of us seem to be in agreement that it should ultimately be a collection of IDNA2008-conformant mapping rules, it seems to me that the specification would be stronger if it were constructed more as a "migrating to IDNA2008" one than as a "migrating [reluctantly?] away from IDNA2003" one. Changing the terminology and tone a bit could go a long way in that direction. Again, I see most of these issues as being more about details and presentation than about fundamentals. If Mark were interested in forming a small editorial group to make changes along the lines I've outlined, and thought it would be useful, I'd be happy to join in the effort. best, john
Received on Friday, 23 August 2013 15:13:43 UTC