Re: Standardizing on IDNA 2003 in the URL Standard from John C Klensin on 2013-08-23 (public-iri@w3.org from August 2013)

From: John C Klensin <klensin@jck.com>
Date: Fri, 23 Aug 2013 08:25:05 -0400
To: Andrew Sullivan <ajs@anvilwalrusden.com>, Anne van Kesteren <annevk@annevk.nl>
cc: IDNA update work <idna-update@alvestrand.no>, "PUBLIC-IRI@W3.ORG" <public-iri@w3.org>, uri@w3.org, Peter Saint-Andre <stpeter@stpeter.im>, Marcos Sanz <sanz@denic.de>, "Mark Davis ?" <mark@macchiato.com>, Vint Cerf <vint@google.com>, "www-tag.w3.org" <www-tag@w3.org>
Message-ID: <0E13633FCAF195774B71D936@JcK-HP8200.jck.com>
--On Thursday, August 22, 2013 12:26 -0400 Andrew Sullivan
<ajs@anvilwalrusden.com> wrote:

> On Thu, Aug 22, 2013 at 04:11:15PM +0100, Anne van Kesteren
> wrote:
>> discussion here which makes matters confusing. What matters is
>> IDNA2003 as implemented and deployed throughout the DNS.
> 
> Except it's _not_ deployed throughout the DNS.  The ASCII-form
> is what's in the DNS.  For the overwhelming majority of cases
> of valid, actually deployed IDNA2003 labels that we have ever
> found, there will be no change.  And the applications are
> still doing the work of translating those labels to Unicode.
>...

Let me add a bit to this and see if I can make a useful
suggestion.

When the IDNA2003 discussions were occurring, the main rationale
for the various mappings (CaseFolding, NFKC, etc.) was precisely
what Anne mentioned early in the thread -- to give the users
what they would expect if, e.g., they typed FöO.example.com
rather than föo.example.com.  IDNA2008 (especially RFC 5895 and
arguably UTR 46) are consistent with that view about user typing
and the user experience.

The place where this gets knotty is that, whether it got written
down or not, there was a general expectation among most of the
IDNA2003 participants that "real" canonical-form URLs -- the
stuff that gets transmitted between systems, would appear in
arefs, etc.-- would have their domain components in
ASCII-encoded form, matching what, as Andrew notes, is deployed
in the DNS.  From that ASCII-encoded and DNS perspective, things
like Eszett are non-problems because it simply could not be
encoded under IDNA2003 -- it could be mapped to "ss" from user
input, but there was no way that ToUnicode(string) could even
produce a label containing one -- Punycode-encoded strings that
could include a representation of a Eszett character could not
exist prior to IDNA2008, so, from the DNS point of view, their
addition wasn't even an incompatible change.

Again from that perspective, where we got into trouble was that
browsers, presumably responding to the demands of page authors,
not only allowed native-character domain name labels in URLs but
even allowed the non-canonical forms.  People took advantage of
that, as they will, and we ended up where we are today.  But
that isn't an IDNA2008 problem because, from a good practices
standpoint, it, especially having non-canonical forms and
depending on mapping, was a bad idea even for IDNA2003.  On the
DNS registration side, several parties took advantage of the
mappings and sold/ delegated native-character labels that could
note be mapped back from their Punycode-encoded forms -- another
thing that was clearly a bad practice at the time, but they were
no more deterred than some page authors (and email users, btw)
were.

Suggestions, at least as a starting point for some discussion:

(1) Move toward IDNA2008 terminology.  We got rid of the
IDNA2003 terminology because it just got too clumsy when people
tried to be unambiguous about what they were talking about.  In
the process, stop thinking about "IDNA2003 without Unicode
version restrictions".  While the intent is clear, as others
have pointed out, that phrase can be used to describe enough
different things to be a potential source of interoperability
problems.  As noted below, which IDNA2008 terminology is
necessary, it may not be sufficient.   Note that this suggestion
doesn't require that anyone do anything different, only that we
change how we talk about it.

(2) For those who don't already, try to understand the reasons
for moving away from IDNA2003 rather than just saying "lots of
people are still using it" (whether that is correct or not).
Several of those reasons have been pointed out in this
discussion.  For the benefit of those who didn't see it in this
multiple-list discussion, Olaf Kolkman recently reminded those
on the IDNA-update list about the discussion in RFC 4690,
especially Section 5.3,
http://tools.ietf.org/html/rfc4690#section-5.3. 

(3) For strings that are valid under both IDNA2003 and IDNA2008,
try to remember in our various conversations that what has often
been called "preserving backward compatibility" or "preserving
IDNA2003 behavior" is also "ignoring what the document or user
specified and doing something else instead".  

(4) Define a canonical form for the domain name part of a URL
and specify its use wherever that is feasible from a production
and user interface standpoint.  For closeness to the DNS and
what actually appears there, that means that IDNs appear as
A-labels.  If you decide you need to support native character
forms (as encoded UTF-8 or in IRIs) for whatever reason,
possibly including the considerations of RFC 6055, the canonical
form should allow IDNs only as U-labels.  Noting the things like
certificates and their DNS analogues aren't, in general, going
to work with strings that require mappings to get to labels,
U-labels (and A-labels) are always safe and unambiguous, even
where other things might be plausible.

(5) For input from users. existing documents, etc., you will
almost certainly need support for a certain amount of mapping
(even if only case folding where that is appropriate).
Encourage designs that keep that as local as possible, i.e.,
that involve early conversions to U-labels and retention of the
U-labels.  Then borrow from some of this thread or the comment
about flags in UTR46 and consider when and how aggressively to
warn whomever is relevant that depending on those mappings is
dangerous and may lead to trouble.   Personally, I'd favor being
much more aggressive with page authors than with users and would
leave those who don't have much control over what is actually
going on to their own devices.  Gerv and others may have better
ideas.

(6) Search engines and other things that return links should
return only canonical forms as discussed in (4) when those are
possible.  Obviously, it isn't for strings that are disallowed
entirely, but this is important as a "get the users used to it"
transition step for strings that map into valid U-labels.  There
is little reason for them to try to preserve forms that require
mapping, even if they found a particular resource by going
through a link that did.  Similarly, when a domain name is
displayed back to a user, it should be displayed in canonical
form with either A-labels or U-labels.  If that isn't what the
user typed, the difference can be a small security clue and
source of education for users who are paying attention.  I
believe that some systems are doing those things already.

(7) IMO, UTR46 needs some work.  The suggestions above lay the
foundation for what I believe is the most important substantive
piece of that work, and complement Mark's recent notes.  I
believe that UTR46 is in need of serious discussion of when it
is plausible to shut off the "transition" machinery.  Mark's
recent notes provide most of the information and text that I
believe need to be in the spec itself.   It is almost trivial by
comparison, but I think it should contain some strong language
explaining why it is unreasonable to claim conformance with or
application of UTR46 without a statement as to which (if any)
transition mechanisms are being applied (e.g., whether a domain
name containing Eszett, ZWJ, or ZWNJ will be looked up or
changed into something else that the user didn't specify.  I'll
respond separately to some of the details of those notes, but
want to start with the observation that my thinking, at least,
has evolves considerably in the last three or four years and
that I think we are now quibbling about details rather than
having major disagreements.

best,
   john





> 
> IDNA2008 is supposed not only to reduce the number of code
> points that are permitted by the protocol.  Among other
> things, it's also designed to improve the underlying
> normalization (NFC, which is better for these purposes than
> NFKC according to UTC documents); to permit the use of certain
> joiners that our Arabic-script using colleagues insist are
> extremely important to them (you should hear the reaction when
> I tell Arabic-using people that browsers aren't planning to do
> IDNA2008 yet); to ensure that every U-label has exactly one
> A-label and conversely (which is not true under IDA2003); and
> still to make possible the kind of mapping that is required in
> IDNA2003 while yet permitting more locale-sensitive treatment
> in the unusual cases where that is appropriate.
> 
> Given the places the Internet is growing, and if we assume
> that domain names will continue to be at all important, the
> number of IDNs actually deployed today is a tiny percentage of
> what it will be in the near future, especially as more IDN
> TLDs come online.  We need to fix the known issues before it
> really is absolutely too late to do anything. 
> 
> Best,
> 
> A
Received on Friday, 23 August 2013 12:25:43 UTC