- From: John C Klensin <klensin@jck.com>
- Date: Fri, 23 Aug 2013 08:25:05 -0400
- To: Andrew Sullivan <ajs@anvilwalrusden.com>, Anne van Kesteren <annevk@annevk.nl>
- cc: IDNA update work <idna-update@alvestrand.no>, "PUBLIC-IRI@W3.ORG" <public-iri@w3.org>, uri@w3.org, Peter Saint-Andre <stpeter@stpeter.im>, Marcos Sanz <sanz@denic.de>, "Mark Davis ?" <mark@macchiato.com>, Vint Cerf <vint@google.com>, "www-tag.w3.org" <www-tag@w3.org>
--On Thursday, August 22, 2013 12:26 -0400 Andrew Sullivan <ajs@anvilwalrusden.com> wrote: > On Thu, Aug 22, 2013 at 04:11:15PM +0100, Anne van Kesteren > wrote: >> discussion here which makes matters confusing. What matters is >> IDNA2003 as implemented and deployed throughout the DNS. > > Except it's _not_ deployed throughout the DNS. The ASCII-form > is what's in the DNS. For the overwhelming majority of cases > of valid, actually deployed IDNA2003 labels that we have ever > found, there will be no change. And the applications are > still doing the work of translating those labels to Unicode. >... Let me add a bit to this and see if I can make a useful suggestion. When the IDNA2003 discussions were occurring, the main rationale for the various mappings (CaseFolding, NFKC, etc.) was precisely what Anne mentioned early in the thread -- to give the users what they would expect if, e.g., they typed FöO.example.com rather than föo.example.com. IDNA2008 (especially RFC 5895 and arguably UTR 46) are consistent with that view about user typing and the user experience. The place where this gets knotty is that, whether it got written down or not, there was a general expectation among most of the IDNA2003 participants that "real" canonical-form URLs -- the stuff that gets transmitted between systems, would appear in arefs, etc.-- would have their domain components in ASCII-encoded form, matching what, as Andrew notes, is deployed in the DNS. From that ASCII-encoded and DNS perspective, things like Eszett are non-problems because it simply could not be encoded under IDNA2003 -- it could be mapped to "ss" from user input, but there was no way that ToUnicode(string) could even produce a label containing one -- Punycode-encoded strings that could include a representation of a Eszett character could not exist prior to IDNA2008, so, from the DNS point of view, their addition wasn't even an incompatible change. Again from that perspective, where we got into trouble was that browsers, presumably responding to the demands of page authors, not only allowed native-character domain name labels in URLs but even allowed the non-canonical forms. People took advantage of that, as they will, and we ended up where we are today. But that isn't an IDNA2008 problem because, from a good practices standpoint, it, especially having non-canonical forms and depending on mapping, was a bad idea even for IDNA2003. On the DNS registration side, several parties took advantage of the mappings and sold/ delegated native-character labels that could note be mapped back from their Punycode-encoded forms -- another thing that was clearly a bad practice at the time, but they were no more deterred than some page authors (and email users, btw) were. Suggestions, at least as a starting point for some discussion: (1) Move toward IDNA2008 terminology. We got rid of the IDNA2003 terminology because it just got too clumsy when people tried to be unambiguous about what they were talking about. In the process, stop thinking about "IDNA2003 without Unicode version restrictions". While the intent is clear, as others have pointed out, that phrase can be used to describe enough different things to be a potential source of interoperability problems. As noted below, which IDNA2008 terminology is necessary, it may not be sufficient. Note that this suggestion doesn't require that anyone do anything different, only that we change how we talk about it. (2) For those who don't already, try to understand the reasons for moving away from IDNA2003 rather than just saying "lots of people are still using it" (whether that is correct or not). Several of those reasons have been pointed out in this discussion. For the benefit of those who didn't see it in this multiple-list discussion, Olaf Kolkman recently reminded those on the IDNA-update list about the discussion in RFC 4690, especially Section 5.3, http://tools.ietf.org/html/rfc4690#section-5.3. (3) For strings that are valid under both IDNA2003 and IDNA2008, try to remember in our various conversations that what has often been called "preserving backward compatibility" or "preserving IDNA2003 behavior" is also "ignoring what the document or user specified and doing something else instead". (4) Define a canonical form for the domain name part of a URL and specify its use wherever that is feasible from a production and user interface standpoint. For closeness to the DNS and what actually appears there, that means that IDNs appear as A-labels. If you decide you need to support native character forms (as encoded UTF-8 or in IRIs) for whatever reason, possibly including the considerations of RFC 6055, the canonical form should allow IDNs only as U-labels. Noting the things like certificates and their DNS analogues aren't, in general, going to work with strings that require mappings to get to labels, U-labels (and A-labels) are always safe and unambiguous, even where other things might be plausible. (5) For input from users. existing documents, etc., you will almost certainly need support for a certain amount of mapping (even if only case folding where that is appropriate). Encourage designs that keep that as local as possible, i.e., that involve early conversions to U-labels and retention of the U-labels. Then borrow from some of this thread or the comment about flags in UTR46 and consider when and how aggressively to warn whomever is relevant that depending on those mappings is dangerous and may lead to trouble. Personally, I'd favor being much more aggressive with page authors than with users and would leave those who don't have much control over what is actually going on to their own devices. Gerv and others may have better ideas. (6) Search engines and other things that return links should return only canonical forms as discussed in (4) when those are possible. Obviously, it isn't for strings that are disallowed entirely, but this is important as a "get the users used to it" transition step for strings that map into valid U-labels. There is little reason for them to try to preserve forms that require mapping, even if they found a particular resource by going through a link that did. Similarly, when a domain name is displayed back to a user, it should be displayed in canonical form with either A-labels or U-labels. If that isn't what the user typed, the difference can be a small security clue and source of education for users who are paying attention. I believe that some systems are doing those things already. (7) IMO, UTR46 needs some work. The suggestions above lay the foundation for what I believe is the most important substantive piece of that work, and complement Mark's recent notes. I believe that UTR46 is in need of serious discussion of when it is plausible to shut off the "transition" machinery. Mark's recent notes provide most of the information and text that I believe need to be in the spec itself. It is almost trivial by comparison, but I think it should contain some strong language explaining why it is unreasonable to claim conformance with or application of UTR46 without a statement as to which (if any) transition mechanisms are being applied (e.g., whether a domain name containing Eszett, ZWJ, or ZWNJ will be looked up or changed into something else that the user didn't specify. I'll respond separately to some of the details of those notes, but want to start with the observation that my thinking, at least, has evolves considerably in the last three or four years and that I think we are now quibbling about details rather than having major disagreements. best, john > > IDNA2008 is supposed not only to reduce the number of code > points that are permitted by the protocol. Among other > things, it's also designed to improve the underlying > normalization (NFC, which is better for these purposes than > NFKC according to UTC documents); to permit the use of certain > joiners that our Arabic-script using colleagues insist are > extremely important to them (you should hear the reaction when > I tell Arabic-using people that browsers aren't planning to do > IDNA2008 yet); to ensure that every U-label has exactly one > A-label and conversely (which is not true under IDA2003); and > still to make possible the kind of mapping that is required in > IDNA2003 while yet permitting more locale-sensitive treatment > in the unusual cases where that is appropriate. > > Given the places the Internet is growing, and if we assume > that domain names will continue to be at all important, the > number of IDNs actually deployed today is a tiny percentage of > what it will be in the near future, especially as more IDN > TLDs come online. We need to fix the known issues before it > really is absolutely too late to do anything. > > Best, > > A
Received on Friday, 23 August 2013 12:25:42 UTC