RE: Unicode Normalization in XML 1.0 5e

For comparison, my proposed wording is at
http://lists.w3.org/Archives/Public/public-xml-core-wg/2009Apr/0019

I'd like to be able to finalize this by our next telcon.

I'd particularly like to ask that Richard and John review
the suggested wording below and let us know their opinion.

Of course, I'd like to hear from everyone.  I just figured
if I mentioned a couple names in particular, there was more
of a chance I'd get at least some response.

paul

> -----Original Message-----
> From: Phillips, Addison [mailto:addison@amazon.com] 
> Sent: Wednesday, 2009 May 20 21:50
> To: public-i18n-core@w3.org; Grosso, Paul; public-xml-core-wg@w3.org
> Cc: w3c-html-cg@w3.org
> Subject: RE: Unicode Normalization in XML 1.0 5e
> 
> Dear Paul,
> 
> The Internationalization Core WG discussed the problem with 
> normalization during our teleconference today [1] and, in 
> response to your email of 30 April [2], the WG decided the following.
> 
> We are okay with the general idea of issuing an erratum on 
> XML 1.0 5e to address this problem. However, we were not 
> quite satisfied with the wording proposed. We would like to 
> propose that you replace it with the following:
> 
> --
> _Unicode_ (rule C06) says that canonically equivalent 
> sequences of characters ought to be treated as identical. 
> However, XML _parsed entities_ (including _document 
> entities_) that are canonically equivalent according to 
> Unicode but which use distinct code point (character) 
> sequences are considered distinct by XML processors. 
> Therefore, all XML parsed entities SHOULD be created in a 
> "fully normalized" form per _[CharMod-Norm]_. Otherwise the 
> user might unknowingly create canonically equivalent but 
> unequal sequences that appear identical to the user but which 
> are treated as distinct by XML processors.
> 
> A document is still well-formed, even if it is not in a 
> normalized form. XML processors MAY verify that the document 
> being processed is in a fully-normalized form and report to 
> the application whether it is or not.
> --
> 
> In our discussion of this issue, we are concerned about the 
> appropriate terminology to use here. We think that it may be 
> appropriate in some cases for content to be in a 
> non-normalized form. For example, one might have an element 
> <foo> that contains an single Unicode combining mark, like so:
> 
>   <foo>&#x301;</foo>
> 
> This sequence is not "full normalized", but, we think it is 
> both your and our intention that it be valid and that the 
> element 'foo' contain the character U+0301, even though 
> U+0301 is a combining mark. In considering our proposed text 
> above, we are concerned that the term "parsed entity" might 
> be too broad, if it is considered to include attribute and 
> element content (and not just the names of XML document 
> structures). Please consider this when implementing our 
> proposed text and/or advise us whether or not parsed entity 
> is the right choice for the meaning imputed here.
> 
> Kind regards,
> 
> Addison (for I18N)
> 
> [1] http://www.w3.org/2009/05/06-core-minutes.html
> [2] 
> http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJu
n/0043.html
> 
> Addison Phillips
> Globalization Architect -- Lab126
> Chair -- W3C Internationalization WG
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 

Received on Thursday, 21 May 2009 13:04:04 UTC