Re: IDNA reference (Issue #16) from John C Klensin on 2010-09-30 (public-iri@w3.org from September 2010)

From: John C Klensin <john-ietf@jck.com>
Date: Thu, 30 Sep 2010 09:31:57 -0400
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
cc: Wil Tan <wil@dready.org>, Julian Reschke <julian.reschke@gmx.de>, public-iri@w3.org
Message-ID: <1580D642D916D6E5E82D84A7@PST.JCK.COM>
--On Tuesday, September 28, 2010 20:27 +0900 "\"Martin J.
Dürst\"" <duerst@it.aoyama.ac.jp> wrote:

>> Actually, 5890/91/92/93 and arguably the still unpublished RFC
>> 5895.  RFC 5894 is not normative, but contains the
>> explanations that might be more useful to some people as well
>> as a discussion of the transition issues.
> 
> I have added (at first unused) references to RFC 5890 and
> 5891. I have referenced RFC 5890 here. I think it should be
> obvious to the reader of that document that they have to look
> at the others, too. I don't think we want to have a whole list
> of RFCs every time there is something about IDNA, but of
> course if there is something specific regarding one of the
> other documents (e.g. bidi,...), we'll also add a direct
> reference to that.

FWIW, this works for me.

>>>> What's the right reference for ToASCII now?
>> 
>>> The closest thing would be sections 5.1 to 5.5 of RFC 5891,
> 
> Again, we are not looking for the actual operation, but for
> the valitity check it provides. I think that therefore
> U-Labels at
> http://tools.ietf.org/html/rfc5890#section-2.3.2.1
> are the rigth point to reference.

Ok.   I suggested the operation only because ToASCII was used,
it is definitely an operation, and the question was for a
reference for ToASCII.
 
>>> but simply referencing them will lead to incompatibility
>>> (e.g. producing different A-labels from the IDNA2003
>>> version.)
> 
> Does it produce different A-labels? My understanding is that
> it produces either the same A-label or no A-label, with the
> very specific exceptions of the ς (final sigma) and ß
> (sharp-s) only.

That is correct unless we made a very serious mistake somewhere.
In some sense, the difference between IDNA2003 and IDNA2008 is
that the number of strings that can be processed to produce what
we now call A-labels has decreased significantly.  But that is
consistent with your "no A-label" case above.

>...
>>> http://unicode.org/reports/tr46/ details a good transition
>>> strategy, but I wonder how one could work that into iri-bis.
>> 
>> TR46 (which is not yet a stable reference since the text is
>> still under review and may change yet again), details a
>> transition strategy.  But it is one that does not have IETF
>> consensus, partially because it posits a much slower
>> transition to allow for circumstances that are either very
>> low frequency or that represented abuses even under
>> pre-IDNA2008 standards and best practices.   Let's not make
>> things more confusing by trying to reference it as if it were
>> the only reasonable approach to the situation.
> 
> See Michel's mail for some details. I think we have to look
> into whether and how we can use TR46 for describing additional
> normalization at least in the normalization section (some
> applications such as spiders prefer to normalize as
> aggressively as possible to reduce the possibility of fetching
> the same thing twice).

Well, RFC 5895 certainly permits normalizing as aggressively as
one likes, it is just quite deliberately not normative.  The
difficulty is that, once one moves beyond canonical
normalization (NFC or NFD), and becomes more aggressive, one
starts running into edge cases in which some names that users
and registrants believe are different become the same,
effectively making one of them completely inaccessible.  The two
cases that the IDNABIS WG quite deliberately created (final
sigma and sharp-s) after long debate are examples of this, but
so are the notorious dotless-i problem, a number of Han
characters that are safe to map away except when they are used
in personal names (the latter are not PVALID today, but it is
easy to imagine a strong case being made in the future for
reclassifying them), the Arabic and Farsi Yeh character, a
number of characters that represent numerals, and so on.

For the IRI spec to assume or require aggressive mapping that
goes well beyond the very conservative assumptions of RFC 5895
(you will recall that much of the relevant descriptive text was
moved into what is now RFC 5894) risks creating disconnects and
inappropriate restrictions on user and registrant behavior and
on the future evolution of IDNA.

    john
Received on Thursday, 30 September 2010 13:33:12 UTC