W3C home > Mailing lists > Public > uri@w3.org > August 2013

Re: Standardizing on IDNA 2003 in the URL Standard

From: Mark Davis ☕ <mark@macchiato.com>
Date: Wed, 21 Aug 2013 17:01:42 +0200
Message-ID: <CAJ2xs_Fu8YXtYv99mJ6ASHpCdqmM-J_XVB3G8To65Voad3ihGw@mail.gmail.com>
To: John C Klensin <klensin@jck.com>
Cc: Marcos Sanz <sanz@denic.de>, Anne van Kesteren <annevk@annevk.nl>, Shawn Steele <Shawn.Steele@microsoft.com>, "PUBLIC-IRI@W3.ORG" <public-iri@w3.org>, "uri@w3.org" <uri@w3.org>, Peter Saint-Andre <stpeter@stpeter.im>, IDNA update work <idna-update@alvestrand.no>, Vint Cerf <vint@google.com>, "www-tag.w3.org" <www-tag@w3.org>
Mark <https://plus.google.com/114199149796022210033>
*— Il meglio è l’inimico del bene —*

On Tue, Aug 20, 2013 at 9:33 PM, John C Klensin <klensin@jck.com> wrote:

> --On Tuesday, August 20, 2013 15:55 +0200 Marcos Sanz
> <sanz@denic.de> wrote:
> > idna-update-bounces@alvestrand.no wrote on 20/08/2013 14:32:23:
> >
> >> On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
> >> <Shawn.Steele@microsoft.com> wrote:
> >> > I concur.  We use the IDNA2008 + TR46 behavior.
> >>
> >> Interesting. Last I checked Internet Explorer that was not
> >> the case.
> >
> > At this side of the keyboard, ß is still not supported in
> > IE10/Win7-SP1
> But that is completely consistent with IDNA2008 + UTR46 when the
> most IDOA2003-like profile (or, if you prefer, stage of
> transition) of UTR46 is used.   One can debate endlessly whether
> UTF46 is a good idea (and the IDNABIS WG did), but ultimately
> [1] it was intended to provide an environment as much like that
> of IDNA2003 as possible.  That includes:
> --strict backward compatibility with the interpretation
>         of strings that are valid with either IDNA2003 or
>         IDNA2008   and
> -- continued support for strings that were valid in
>         IDNA2003 but that mapped into other strings before being
>         converted using ASCII strings using Punycode where those
>         target strings are valid under IDNA2008
> If one accepts that kind of compatibility as a primary goal,
> then the fact that "ß" was mapped to "ss" in IDNA2003 means
> that mapping must be preserved forever and one will never [2]
> actually be able to store an Eszett in the DNS.
> The bottom line, at least IMO, is that one can adopt either of
> two philosophical models.   In one, whatever decisions were made
> in building the IDNA2003 standard and the name strings those
> decisions allowed are inviolable.  Arguments that errors were
> made, that those strings create risks, or that the rules
> prohibit orthographically-reasonable strings are simply
> irrelevant if they conflict with absolute compatibility.  The
> other(at the risk of showing my biases) is to assume that we are
> human, that mistakes will get made, and that, if they are
> significant, we should figure out how to correct them and move
> on.
> As others have suggested, the latter includes realizing that
> some labels and practices that were allowed under IDNA2003 were
> simply a bad idea and we should move away from them as soon as
> possible rather than encouraging their use in even more
> contexts.  Coming back to the comment that started this note, it
> also means that, if the relevant language communities decide,
> for example, that Eszett is important as a character or that
> zero-width joiners and non-joiners are critical, we need to
> figure out how to accommodate them even if the accommodation is
> not perfect and doesn't solve all problems.  And, in each case,
> we need to remember that the Internet is growing and reaching
> more communities and more people within almost every community,
> making transition now, even if painful, much less painful than
> transition in the future.

The key migration issue is whether people are comfortable having
implementations go to different IP addresses for IDNs containing 'ß' (or
the other 3 related characters). The transitional form in TR46 is for those
who are concerned with that problem. If the registries either bundled 'ss'
with 'ß' or blocked (once either was registered the other could not), then
the ambiguous addressing issue would not be a problem. So it is a matter of
waiting for the significant registries to do that.

> FWIW, without at least some measure of the latter model, we
> would be stuck with HTTP 1.0, HTML 1 (or at least 3), and ISO
> 8859-1 forever.  The decision to interpret a string of non-ASCII
> octets in content as, by default, a good candidate for UTF-8
> rather than Latin-1 is, at least IMO, ultimately an incompatible
> change of far more sweeping impact and consequences than this
> IDNA2003 -> IDNA2008 transition.

That's not a particularly good analogy. ASCII is and remains ASCII in
UTF-8; that's one of its virtues. Latin 1 was just one of many encodings
that used the high bit for different purposes, so UTF-8 was simply one of
many such encodings. It did not represent a backwards incompatibility with
existing standards.

> In an odd way, while I would have preferred to see a much more
> rapid transition, I think that exactly what should be happening
> is happening.  The various registries --both the
> ICANN-supervised ones and many others at the root and various
> other levels-- are prohibiting (and not renewing) strings that
> do not conform with IDNA2008.  Registries that want to support
> labels that are problematic from a transition standpoint have
> devised, or are devising, procedures to lower the odds of
> strings that pose difficulties falling into hostile hands, just
> as many of them do for potentially-confusing strings.  The right
> time to transition systems that look up names involves tricky
> questions including the "pain now or more pain later"
> considerations mentioned above.   And where UTR 46 and/or RFC
> 5895 fit into transition strategies (as distinct from localized
> mapping strategies), or not, is obviously part of that
> transition question.

I agree with that, and it is the scenario envisioned for TR46. That is,
once all (significant) registries move to IDNA2008, then then clients can
impose stricter controls on the characters, excluding the characters that
are disallowed in IDNA2008. Because the registries will have moved, the
number of failing URLs would be acceptable.

> Anne, coming back to your original question, I don't know what
> question you and your colleagues asked that got the "everyone is
> still on IDNA2003" answer.  Especially given the information
> from Microsoft, I suspect it was close to "are you fully
> supporting IDNA2008" for which as "no" answer might lead to a
> "using IDNA2003" answer despite their telling us that they are
> running IDNA2008 with UTR 46.  Others have pointed out that
> "IDNA2003 with the version restriction eliminated" may be a
> sensible statement in individual cases but, because the Nameprep
> profile of Stringprep is not simply Unicode Case Folding plus
> NFKC, it leaves enough open to local interpretation that it is
> not a plausible candidate for a statement in a standard that is
> intended to promote interoperability.
> Against that backdrop, I believe you should interpret what you
> are seeing, not as "everyone is committed to IDNA2003"
> (obviously not true as soon as exceptions are introduced) and
> "IDNA2003 with exceptions forever" but as slow transition.  If
> you want a standard that works going forward, make the
> assumption that the folks who designed IDNA2008 were not fools
> and that browsers should be moving, and eventually will move
> (unless you discourage them) in the IDNA2008 direction.  Whether
> you want to discuss transition or not is up to you.  If you want
> to follow Mark's recommendation (and Microsoft's lead) and
> suggest IDNA2008 plus UTR 46, I suggest you do so in a way that
> really constitutes a transition strategy rather than an "IDNA
> 2003 forever" one, i.e., that you address the issues of when
> "transition processing" gets turned off and the localization
> issues (especially about case folding) mentioned by others.  If
> not, you and your working group put us all at risk of many
> internationalized email applications working differently than
> web browsers do, in a fork between IETF and W3C i18n standards,
> divergence between assumptions and norms used by those who
> create DNS names and those who look them up, and so on.  I hope
> we can agree that those would be bad outcomes.
> regards,
>     john
>  -----------
> [1] I hope Mark will more or less agree with this
> characterization; it is a accurate and neutral as I know how to
> make it.

Yes, thanks.

> [2[ This is associated with one of the key criticisms of UTR 46
> that has not been discussed so far:  It has been described as a
> transition strategy, but there is really no mechanism in it for
> deciding when to adopt the IDNA2008 model and rules in favor of
> strict backward-compatibility with as many names that were valid
> under IDNA2003 as possible.   In reality, saying "we use UTR 46"
> or "we conform to UTR 46" is somewhat underspecified because UTR
> 46 can be used strictly for local mapping, with what it calls
> "transition processing" (which is where Eszett disappears),
> and/or with other optional features such as flagging, but
> continuing to look up, strings that contain punctuation or
> symbol characters.  Either of those latter options makes a
> so-called "IDNA2008 + UTR46" implementation non-conforming with
> IDNA2008.

Yes, it is the latter two options that can disappear under the right
conditions (as above).​​
Received on Wednesday, 21 August 2013 15:02:19 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:16 UTC