RE: Unicode Normalization in XML 1.0 5e

 

> -----Original Message-----
> From: Phillips, Addison [mailto:addison@amazon.com] 
> Sent: Wednesday, 2009 May 20 21:50
> To: public-i18n-core@w3.org; Grosso, Paul; public-xml-core-wg@w3.org
> Cc: w3c-html-cg@w3.org
> Subject: RE: Unicode Normalization in XML 1.0 5e
> 
> Dear Paul,
> 
> The Internationalization Core WG discussed the problem with 
> normalization during our teleconference today [1] and, in 
> response to your email of 30 April [2], the WG decided the following.
> 
> We are okay with the general idea of issuing an erratum on 
> XML 1.0 5e to address this problem. However, we were not 
> quite satisfied with the wording proposed. We would like to 
> propose that you replace it with the following:
> 
> --
> _Unicode_ (rule C06) says that canonically equivalent 
> sequences of characters ought to be treated as identical. 
> However, XML _parsed entities_ (including _document 
> entities_) that are canonically equivalent according to 
> Unicode but which use distinct code point (character) 
> sequences are considered distinct by XML processors. 
> Therefore, all XML parsed entities SHOULD be created in a 
> "fully normalized" form per _[CharMod-Norm]_. Otherwise the 
> user might unknowingly create canonically equivalent but 
> unequal sequences that appear identical to the user but which 
> are treated as distinct by XML processors.
> 
> A document is still well-formed, even if it is not in a 
> normalized form. XML processors MAY verify that the document 
> being processed is in a fully-normalized form and report to 
> the application whether it is or not.
> --

I've asked the XML Core WG to review your proposed wording.

> 
> In our discussion of this issue, we are concerned about the 
> appropriate terminology to use here. We think that it may be 
> appropriate in some cases for content to be in a 
> non-normalized form. For example, one might have an element 
> <foo> that contains an single Unicode combining mark, like so:
> 
>   <foo>&#x301;</foo>
> 
> This sequence is not "full normalized", but, we think it is 
> both your and our intention that it be valid and that the 
> element 'foo' contain the character U+0301, even though 
> U+0301 is a combining mark.

Substituting "well-formed" for "valid" above, yes, that is
certainly our intention.  

However, it is also our intention that the note implies 
that such would not be considered fully normalized, so if
an XML processor is verifying that the document is in a
fully-normalized form, it would report, in this case, that
it is not.

> In considering our proposed text 
> above, we are concerned that the term "parsed entity" might 
> be too broad, if it is considered to include attribute and 
> element content (and not just the names of XML document 
> structures).

"Parsed entity" is clearly defined in the XML spec, and yes, 
it includes all the content.

Please see
http://www.w3.org/TR/xml11/#sec-normalization-checking
for what the XML 1.1 spec says in this area.  Note, it clearly
includes all text (replacement text, character data, etc.).

It is our intention that the note we are adding to XML 1.0
match what XML 1.1 says about normalization (except that we
cannot require a user option as XML 1.1 does).

paul

> Please consider this when implementing our 
> proposed text and/or advise us whether or not parsed entity 
> is the right choice for the meaning imputed here.
> 
> Kind regards,
> 
> Addison (for I18N)
> 
> [1] http://www.w3.org/2009/05/06-core-minutes.html
> [2] 
> http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJu
n/0043.html
> 
> Addison Phillips
> Globalization Architect -- Lab126
> Chair -- W3C Internationalization WG
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 

Received on Thursday, 21 May 2009 13:17:26 UTC