Re: Standardizing on IDNA 2003 in the URL Standard from Mark Davis ☕ on 2014-01-16 (www-tag@w3.org from January 2014)

From: Mark Davis ☕ <mark@macchiato.com>
Date: Thu, 16 Jan 2014 16:32:25 +0100
To: Anne van Kesteren <annevk@annevk.nl>
Cc: Gervase Markham <gerv@mozilla.org>, John C Klensin <klensin@jck.com>, yaojk <yaojk@cnnic.cn>, Paul Hoffman <paul.hoffman@vpnc.org>, "PUBLIC-IRI@W3.ORG" <public-iri@w3.org>, "uri@w3.org" <uri@w3.org>, IDNA update work <idna-update@alvestrand.no>, "www-tag.w3.org" <www-tag@w3.org>
Message-ID: <CAJ2xs_GFZX-SCRHycuMPRkBDxGS5BMYQj1WaA3bPF80Nva2e7A@mail.gmail.com>

I will be brief, because I don't have much time for this topic this week.
(It should teach me to be quiet...)

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Thu, Jan 16, 2014 at 3:27 PM, Anne van Kesteren <annevk@annevk.nl> wrote:

> On Thu, Jan 16, 2014 at 1:24 PM, Mark Davis ☕ <mark@macchiato.com> wrote:
> > It is not unlikely that an implementation that you think is following
> > IDNA2003 (with a non-standard, larger repertoire) is actually following
> UTS
> > 46.
>
> I know for a fact that Gecko has not changed its implementation (but
> has updated Unicode since the release of IDNA2003, doh). It "passes"
> the Pile of Poo Test™:
>
> <a href="http://💩.com/">test</a>
> <script>alert(document.querySelector("a").host)</script>
>

The problem is, as Andrew and others have said, IDNA2003 does not specify
*how* one would update to a new version of Unicode: that is, exactly which
new characters would be accepted and which not, and how to case-map them.

> Alerts: xn--ls8h.com
>
> Chrome alerts the same and reportedly has updated to UTS46 (compatible
> mode), so as you point out the differences are probably minor and
> require checking of some obscurer code points.
>
>
> > There is a table in
> > http://unicode.org/reports/tr46/#Table_IDNA_Comparisons
>
> That is an interesting table. Ⅎ (line c) seems indeed disallowed in
> Chrome, yet 㛼 (line d) which should also be disallowed per that table
> works fine. Both work fine in Firefox. Both Chrome and Firefox map ！
> (line b) to ! and do not cause parsing to fail because of it, even
> though the table suggests it should. (Presumably do it making
> assumptions about ASCII that browsers do not share.)
>

I'd have to look at those cases.

>
> Firefox and Safari map ؂ (line i) and Chrome does not.
>
>
> > One way to look at UTS 46 is as a migration layer to support client
> > implementations during the transition of registries from IDNA2003 to
> > IDNA2008, plus a mapping layer that can be used with straight IDNA2008.
>
> I'm not sure what this means. Do you think we will ever stop mapping
> U+3002 to U+002E?

> Or A to a?
>

I'm assuming that you mean the ascii characters (I'm not going to check
whether you have just look-alikes.). ASCII case mapping is covered at a
different level.

I don't think clients would stop
mapping, and IDNA2008 permits it. That's why I said "
plus a mapping layer that can be used with straight IDNA2008
"

>
> >> I think I did mention earlier on UTS46 might be okay, depending on the
> > details. I am hoping to hear from Mark on the matter.
> >
> > I'm not sure what specific questions you have about UTS 46. Can you
> > reiterate them?
>
> You keep talking about UTS 46 as if it were a migration layer, which
> suggests it might go away. That does not really seem acceptable to me.
>

UTS 46 will stay around, if only for the mapping layer.

Whether the rest would be used by clients really depends on the progress
made by registries. As for the deviation-character support, I think
implementations could stop supporting them if the affected
registries
 enforced bundle-or-block. As to the additional symbols,
implementations could stop
supporting
 them
 if the
registries
 forbade them.

>
> It enforces DNS length restrictions on domain names (IDNA2003 did the
> same), which does not appear to be implemented in browsers. They're
> fine with a label longer than a hundred code points. I don't think
> this should be outlawed at the parsing layer because the name might be
> used outside the DNS.
>

That was never a topic of discussion in any of the standards discussions.

>
> I wish it contained the actual ASCII restrictions we need in practice
> rather than deferring those to the application, but I suppose I can
> define those in the URL Standard and use UseSTD3ASCIIRules=false.
>
> Another wish I have is that the algorithms are a bit clearer in terms
> of input and output. What argument does ToASCII take? What about
> ToUnicode?
>
> E.g. how would you replace "domain to ASCII" and "domain to Unicode"
> in http://url.spec.whatwg.org/#concept-host-parser with UTS46 and
> ensure the algorithm still has the same kind of expected output?

http://unicode.org/reports/tr46/#ToASCII

If there are specific areas where you find the spec unclear, I suggest that
you provide feedback as instructed at the top of the spec. Subsequent
versions can then clarify those points.

> It's
> not entirely clear to me how to make use of your work.
>

You may not have meant a singular 'you', but just for clarity: it's not
"my" work; it is the work of the Unicode consortium, with many individuals
and companies involved.

>
>
> --
> http://annevankesteren.nl/
>

Received on Thursday, 16 January 2014 15:32:58 UTC