RE: Unicode Normalization in XML 1.0 5e

Addison et al.,

The XML Core WG has discussed your message during several
telcons, and we are still in the process of determining
just what we might do in response.

At this time, we are quite sure we do not want to change
the XML spec so that canonical equivalents could be treated 
as identical directly in XML.  Aside from being a serious
change to parser behavior, this would make some previously
ill-formed (non-XML) documents well-formed XML as well as
make some previously well-formed XML ill-formed (non-XML).

We are also pretty sure it would be a good idea to add at
least a note to the XML 1.0 spec saying that XML producers
SHOULD produce normalized output.

We are considering whether we should add (some version of)
what the XML 1.1 spec says about normalization checking [1]
to the XML 1.0 spec.  We haven't made a decision here yet,
and given our biweekly telcon schedule and the upcoming AC
meeting, we are not likely to do so until some time in April.

I will, of course, let you know when we have a further status
update to give you.

regards,

paul

for the XML Core WG

[1] http://www.w3.org/TR/xml11/#sec-normalization-checking

> -----Original Message-----
> From: public-xml-core-wg-request@w3.org 
> [mailto:public-xml-core-wg-request@w3.org] On Behalf Of 
> Phillips, Addison
> Sent: Wednesday, 2009 February 25 0:17
> To: public-xml-core-wg@w3.org
> Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> Subject: Unicode Normalization in XML 1.0 5e
> 
> Dear XML Core WG,
> 
> I am writing on behalf of both the Internationalization Core 
> WG and the HTML Coordination Group (HCG).
> 
> Recently there has been an extensive discussion of 
> normalization in W3C specifications, mainly related to 
> handling of element and attribute names and values (as in 
> CSS3 Selectors). Some of this discussion revolves around how 
> Unicode normalization should work with XML and XML-derived 
> specifications, hence I was actioned by HCG [0] to contact you folks.
> 
> I produced a general summary of the Unicode normalization 
> problem at [1] for the HCG. Those unfamiliar with Unicode 
> normalization may wish to review that message.
> 
> The basic question is whether XML can (or should?) take a 
> clearer stance on Unicode normalization. At present, XML 1.0 
> 5e, like its predecessors, does not require any particular 
> normalization form; it says nothing about whether canonical 
> equivalents in Unicode are "equal" from an XML point of view; 
> and thus implies that Unicode canonical equivalence does 
> *not* apply when considering an XML document's formation. The 
> recommendations in Appendix J (which does include 
> normalization among its suggestions) further suggest that 
> this is true.
> 
> On the other hand, it seems reasonable to suppose that 
> Unicode canonical equivalence might apply to XML. Processes 
> such as transcoding legacy charsets to Unicode might result 
> in canonically-equivalent-but-unequal code point sequences, 
> for example. 
> 
> In a survey done at I18N's behest, our Unicode liaison (Mark 
> Davis) produced a survey of content of the Web, as well as a 
> summary on performance [2], which found that 99.98% of Web 
> HTML content was, in fact, in Unicode form NFC. It seems 
> reasonable to suppose that XML content and documents would 
> follow a similar pattern. 
> 
> Our questions to XML Core WG, thus, are:
> 
>    What, precisely, should XML say with regard to Unicode 
> canonical equivalence?
> 
>    Would it be possible to require or allow canonical 
> equivalents to be treated as identical directly in XML (and 
> not merely as a side effect of other specifications)?
> 
>    Is there a problem if XML permits/requires 
> canonically-equivalent-yet-different sequences to be treated 
> as distinct if other specifications require/allow canonical 
> equivalence to be recognized?
> 
> The Internationalization Core WG would be happy to work with 
> you on these thorny issues. Please advise if you need more 
> information, consultation, participation, or just need to vent :-).
> 
> Kind Regards,
> 
> Addison (for I18N/HCG)
> 
> 
> [0] 
> http://lists.w3.org/Archives/Member/w3c-html-cg/2009JanMar/0061.html
>     See ACTION-29
> [1] 
> http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMa
> r/0259.html
> [2] http://www.macchiato.com/unicode/nfc-faq
> 
> 
> Addison Phillips
> Globalization Architect -- Lab126
> Chair -- W3C Internationalization WG
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 
> 

Received on Wednesday, 11 March 2009 16:33:53 UTC