Re: Standardizing on IDNA 2003 in the URL Standard from Gervase Markham on 2013-08-22 (uri@w3.org from August 2013)

From: Gervase Markham <gerv@mozilla.org>
Date: Thu, 22 Aug 2013 12:02:23 +0100
To: Anne van Kesteren <annevk@annevk.nl>
CC: Mark Davis ☕ <mark@macchiato.com>, Shawn Steele <Shawn.Steele@microsoft.com>, IDNA update work <idna-update@alvestrand.no>, "PUBLIC-IRI@W3.ORG" <public-iri@w3.org>, "uri@w3.org" <uri@w3.org>, John C Klensin <klensin@jck.com>, Peter Saint-Andre <stpeter@stpeter.im>, Marcos Sanz <sanz@denic.de>, Vint Cerf <vint@google.com>, "www-tag.w3.org" <www-tag@w3.org>
Message-ID: <5215EFBF.10706@mozilla.org>

On 22/08/13 11:37, Anne van Kesteren wrote:
>> Shame for them. The writing has been on the wall here for long enough
>> that they should not be at all surprised when this stops working.
> 
> I don't think that's at all true. I doubt anyone realizes this. I
> certainly didn't until I put long hours into investigating the IDNA
> situation.

It's not been possible to register names like ☺☺☺.com for some time now;
that's a big clue. The fact that Firefox (and other browsers, AFAIAA)
refuses to render such names as Unicode is another one. (Are your
friends really using http://xn--74h.example.com/ ?)

Those two things, plus the difficulty of typing such names, means that
their use is going to be pretty limited. (Even the guy who is trying to
flog http://xn--19g.com/ , and is doing so on the basis of the fact that
this particular one is actually easy to type on some computers, has not
in the past few years managed to find a "Macintosh company with a
vision" to take it off his hands.)

> Furthermore, we generally preserve compatibility on the web so URLs
> and documents remain working.
> http://www.w3.org/Provider/Style/URI.html It's one of the more
> important parts of this platform.

(The domain name system is about more than just the web.)

IIRC, we must have broken a load of URLs when we decided that %-encoding
in URLs should always be interpreted as UTF-8 (in RFC 3986), whereas
beforehand it depended on the charset of the page or form producing the
link. Why did we do that? Because the new way was better for the future,
and some breakage was acceptable to attain that goal.

So what is the justification for removal of non-letter characters?
Reduction of attack surface. When characters are divided into scripts,
we can enforce no-script-mixing rules to keep the number of possible
spoofs, lookalikes and substitutions tractable for humans to reason
about in the case of a particular TLD and its allowed characters. If we
allowed 3,254 extra random glyphs in every TLD, this would not be so.

Gerv

Received on Thursday, 22 August 2013 11:02:57 UTC