Re: XML 5e, Unicode Normalization, and CharMod: Your thoughts sought...

Hello Addison,

On 2009/05/07 14:10, Phillips, Addison wrote:
> Dear Martin and François,
>
> Recently the I18N WG has been discussing with XML Core WG the problem of normalization in XML 1.0 5e. (François, you are probably aware of this conversation ;-) ). Internally [1] we are considering sending a response to the latest email from XML Core WG proposing text for a new minor version of XML 1.0.
>
> In particular, we are thinking of proposing this text:
>
> --
> Although _Unicode_ (rule C06) says that canonically equivalent sequences of characters ought to be treated as identical, XML _parsed entities_ (including _document entities_) that are canonically equivalent according to Unicode but which use distinct code point (character) sequences are considered distinct by XML processors. Therefore, all XML parsed entities SHOULD be "fully normalized" per _[CharMod-Norm]_. Otherwise, entities that appear to be identical can be treated as distinct, even though this might not be the intention of the user.

- I suggest to split the first sentence into two. Unicode says... 
However, XML parsed entities...

- Full normalization is the right thing in general, but is not always 
appropriate. In particular, we found some cases with SVG and fonts where 
it may not work. But that's okay, because you have a SHOULD.

- The problem that CharMod_Norm isn't in a stable state is still around.


> A document is still well-formed, even if it is not fully normalized. XML processors MAY verify that the document being processed is in fully-normalized form and report to the application whether it is or not.
> --
>
> I have been tasked [1] by I18N WG to ask you for your opinion on the foregoing text as well as the following conundrums:
>
> 1. Introducing this quite explicit text into XML would seem to make untenable requirement C312 in Charmod-Norm [4], which requires string identity matching to be done on normalized text, if the source of the comparison is an XML document's parsed entity.

As CharMod_Norm is still being worked on, we can (and will have to) 
adjust that as necessary. However, except for moving from MUST to 
SHOULD, I don't see a conflict. Note that step 1 of C312 says " MUST be 
performed by the producers of the strings to be compared".

> 2. We think that "fully normalized" might be the right normalization to specify in XML, but would like to ensure that you agree.

I think so.

Regards,    Martin.

> Please let us know what you think.
>
> Kind regards (for I18N),
>
> Addison
>
> [1] http://www.w3.org/2009/05/06-core-minutes.html
> [2] http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJun/0047.html
> [3] http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJun/0043.html
> [4] http://www.w3.org/TR/charmod-norm/#sec-IdentityMatching
>
>
>
> Addison Phillips
> Globalization Architect -- Lab126
>
> Internationalization is not a feature.
> It is an architecture.
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Received on Friday, 8 May 2009 07:59:22 UTC