Re: Standardizing on IDNA 2003 in the URL Standard from Andrew Sullivan on 2013-08-20 (www-tag@w3.org from August 2013)

From: Andrew Sullivan <ajs@anvilwalrusden.com>
Date: Tue, 20 Aug 2013 12:06:20 -0400
To: Anne van Kesteren <annevk@annevk.nl>
Cc: Mark Davis ? <mark@macchiato.com>, Shawn Steele <Shawn.Steele@microsoft.com>, Vint Cerf <vint@google.com>, "public-iri@w3.org" <public-iri@w3.org>, "uri@w3.org" <uri@w3.org>, "idna-update@alvestrand.no" <idna-update@alvestrand.no>, "www-tag.w3.org" <www-tag@w3.org>, Peter Saint-Andre <stpeter@stpeter.im>
Message-ID: <20130820160620.GD21439@mx1.yitter.info>

I'm pretty sure I'm not on many of these lists, so I bet this mail
won't go everywhere.  Nevertheless,

On Tue, Aug 20, 2013 at 01:32:23PM +0100, Anne van Kesteren wrote:
> (Aside: ToASCII in IDNA2003 applies to domain labels. It applying to
> domain names in UTS #46 is somewhat confusing.)

Or "broken".  It can't apply to domain names, of course, because
that's not how the DNS works; but one might be forgiven for wondering
whether not understanding the details of an underlying technical
problem is a barrier to having an opinion in this space.

> I don't think the committee has carefully considered the compatibility
> impact. Deployed domains would become invalid.

The IDNABIS wg did not take that decision lightly.  In my opinion, we
concluded that some deployed domains were just _broken_, and that we
were eventually going to endure this pain, and that it would be better
to do it earlier rather than later.

> Long-standing practice
> of case folding (e.g. the idea that http://EXAMPLE.COM/ and
> http://example.com/ are identical) is suddenly something that is no
> longer decided upon by IDNA but needs to be decided somehow at the
> application-level. 

Well, sort of.  There's nothing in IDNA2008 that prevents the OS from
providing a generic facility for this (which is apparently what the
current generation of Windows does).  

The point was to take this mapping out of the _protocol_ and put it
into local rules that could be made locale-sensitive.  The reason for
this is that, while it is impossible in general to provide case
folding rules where lower-case accented characters get mapped to upper
case without accents and then get case folded again (thereby losing
data), it _might_ be possible to do this in a locale-sensitive way if
one knew enough about the environment.  For instance, in some writing
systems for French, it is standard practice to fold LATIN SMALL LETTER
E WITH ACUTE to LATIN CAPITAL LETTER E (not all French systems, of
course.  Some fold to LATIN CAPITAL LETTER E WITH ACUTE).  Now, if the
LATIN CAPITAL LETTER E is next downcased, what should you get?  The
general rule will of course be LATIN SMALL LETTER E, but if you had a
clever program that could do intellingent things with the string
"ECOLE", the folding might be LATIN SMALL LETTER E WITH ACUTE, or the
folding might try both and see what happens.  This example is a little
contrived -- the French example seems silly -- but examples in other
scripts and languages are in my view considerably more compelling.  I
don't think that UTS#46 is actually different in this regard, although
it proposes uniform mapping rules in all cases.  

IDNA2003 doesn't handle this case real well, because it can't
possibly.  There's simply no room for locale in IDNA2003.

> And when the Unicode consortium provided such
> profiling for applications in the form of
> http://unicode.org/reports/tr46/ that was frowned upon.

I think the history us a little more complicated than that.

Best regards,

A

-- 
Andrew Sullivan
ajs@anvilwalrusden.com

Received on Tuesday, 20 August 2013 16:06:49 UTC