Unicode Normalization in XML 1.0 5e from Phillips, Addison on 2009-02-25 (public-xml-core-wg@w3.org from February 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 24 Feb 2009 22:17:13 -0800
To: "public-xml-core-wg@w3.org" <public-xml-core-wg@w3.org>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "w3c-html-cg@w3.org" <w3c-html-cg@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA019DDAD22D@EX-SEA5-D.ant.amazon.com>

Dear XML Core WG,

I am writing on behalf of both the Internationalization Core WG and the HTML Coordination Group (HCG).

Recently there has been an extensive discussion of normalization in W3C specifications, mainly related to handling of element and attribute names and values (as in CSS3 Selectors). Some of this discussion revolves around how Unicode normalization should work with XML and XML-derived specifications, hence I was actioned by HCG [0] to contact you folks.

I produced a general summary of the Unicode normalization problem at [1] for the HCG. Those unfamiliar with Unicode normalization may wish to review that message.

The basic question is whether XML can (or should?) take a clearer stance on Unicode normalization. At present, XML 1.0 5e, like its predecessors, does not require any particular normalization form; it says nothing about whether canonical equivalents in Unicode are "equal" from an XML point of view; and thus implies that Unicode canonical equivalence does *not* apply when considering an XML document's formation. The recommendations in Appendix J (which does include normalization among its suggestions) further suggest that this is true.

On the other hand, it seems reasonable to suppose that Unicode canonical equivalence might apply to XML. Processes such as transcoding legacy charsets to Unicode might result in canonically-equivalent-but-unequal code point sequences, for example.

In a survey done at I18N's behest, our Unicode liaison (Mark Davis) produced a survey of content of the Web, as well as a summary on performance [2], which found that 99.98% of Web HTML content was, in fact, in Unicode form NFC. It seems reasonable to suppose that XML content and documents would follow a similar pattern.

Our questions to XML Core WG, thus, are:

What, precisely, should XML say with regard to Unicode canonical equivalence?

Would it be possible to require or allow canonical equivalents to be treated as identical directly in XML (and not merely as a side effect of other specifications)?

Is there a problem if XML permits/requires canonically-equivalent-yet-different sequences to be treated as distinct if other specifications require/allow canonical equivalence to be recognized?

The Internationalization Core WG would be happy to work with you on these thorny issues. Please advise if you need more information, consultation, participation, or just need to vent :-).

Kind Regards,

Addison (for I18N/HCG)

[0] http://lists.w3.org/Archives/Member/w3c-html-cg/2009JanMar/0061.html

See ACTION-29
[1] http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0259.html
[2] http://www.macchiato.com/unicode/nfc-faq

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

Received on Wednesday, 25 February 2009 06:17:55 UTC