RE: Unicode Normalization in XML 1.0 5e from Phillips, Addison on 2009-05-21 (public-i18n-core@w3.org from April to June 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Wed, 20 May 2009 19:50:26 -0700
To: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "Grosso, Paul" <pgrosso@ptc.com>, "public-xml-core-wg@w3.org" <public-xml-core-wg@w3.org>
CC: "w3c-html-cg@w3.org" <w3c-html-cg@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA01A08CA21E@EX-SEA5-D.ant.amazon.com>

Dear Paul,

The Internationalization Core WG discussed the problem with normalization during our teleconference today [1] and, in response to your email of 30 April [2], the WG decided the following.

We are okay with the general idea of issuing an erratum on XML 1.0 5e to address this problem. However, we were not quite satisfied with the wording proposed. We would like to propose that you replace it with the following:

--
_Unicode_ (rule C06) says that canonically equivalent sequences of characters ought to be treated as identical. However, XML _parsed entities_ (including _document entities_) that are canonically equivalent according to Unicode but which use distinct code point (character) sequences are considered distinct by XML processors. Therefore, all XML parsed entities SHOULD be created in a "fully normalized" form per _[CharMod-Norm]_. Otherwise the user might unknowingly create canonically equivalent but unequal sequences that appear identical to the user but which are treated as distinct by XML processors.

A document is still well-formed, even if it is not in a normalized form. XML processors MAY verify that the document being processed is in a fully-normalized form and report to the application whether it is or not.
--

In our discussion of this issue, we are concerned about the appropriate terminology to use here. We think that it may be appropriate in some cases for content to be in a non-normalized form. For example, one might have an element <foo> that contains an single Unicode combining mark, like so:

This sequence is not "full normalized", but, we think it is both your and our intention that it be valid and that the element 'foo' contain the character U+0301, even though U+0301 is a combining mark. In considering our proposed text above, we are concerned that the term "parsed entity" might be too broad, if it is considered to include attribute and element content (and not just the names of XML document structures). Please consider this when implementing our proposed text and/or advise us whether or not parsed entity is the right choice for the meaning imputed here.

Kind regards,

Addison (for I18N)

[1] http://www.w3.org/2009/05/06-core-minutes.html

[2] http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJun/0043.html

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

Received on Thursday, 21 May 2009 02:51:04 UTC