XML 5e, Unicode Normalization, and CharMod: Your thoughts sought...

Dear Martin and François,

Recently the I18N WG has been discussing with XML Core WG the problem of normalization in XML 1.0 5e. (François, you are probably aware of this conversation ;-) ). Internally [1] we are considering sending a response to the latest email from XML Core WG proposing text for a new minor version of XML 1.0.

In particular, we are thinking of proposing this text:

--
Although _Unicode_ (rule C06) says that canonically equivalent sequences of characters ought to be treated as identical, XML _parsed entities_ (including _document entities_) that are canonically equivalent according to Unicode but which use distinct code point (character) sequences are considered distinct by XML processors. Therefore, all XML parsed entities SHOULD be "fully normalized" per _[CharMod-Norm]_. Otherwise, entities that appear to be identical can be treated as distinct, even though this might not be the intention of the user.

A document is still well-formed, even if it is not fully normalized. XML processors MAY verify that the document being processed is in fully-normalized form and report to the application whether it is or not.
--

I have been tasked [1] by I18N WG to ask you for your opinion on the foregoing text as well as the following conundrums:

1. Introducing this quite explicit text into XML would seem to make untenable requirement C312 in Charmod-Norm [4], which requires string identity matching to be done on normalized text, if the source of the comparison is an XML document's parsed entity.

2. We think that "fully normalized" might be the right normalization to specify in XML, but would like to ensure that you agree.

Please let us know what you think.

Kind regards (for I18N),

Addison

[1] http://www.w3.org/2009/05/06-core-minutes.html

[2] http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJun/0047.html

[3] http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJun/0043.html

[4] http://www.w3.org/TR/charmod-norm/#sec-IdentityMatching 



Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

Received on Thursday, 7 May 2009 05:11:30 UTC