RE: Unicode Normalization in XML 1.0 5e

Addison et al.,

Regarding this issue, the XML Core WG plans to issue 
an erratum to XML 1.0 5th Edition that adds a note
as follows (where things delimited by underscores should 
be links to the appropriate definition or reference) 
to the end of section 2.2 Characters in XML 1.0:

 Note:

 All XML _parsed entities_ (including _document entities_) SHOULD
 be fully normalized as per _[CharMod]_.

 However, a document is still well-formed even if it is not fully
 normalized. XML processors MAY verify that the document being
 processed is in fully normalized form and report to the application
 whether it is or not.

Then we would also add to A.2 Other References in XML 1.0:

 Charmod
    W3C. Character Model for the World Wide Web 1.0.
    Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf,
    Tex Texin. (See http://www.w3.org/TR/2005/REC-charmod-20050215/.)

Please let us know if this resolution of your issue is acceptable.

regards,

paul

Paul Grosso for the XML Core WG

> -----Original Message-----
> From: public-xml-core-wg-request@w3.org 
> [mailto:public-xml-core-wg-request@w3.org] On Behalf Of Grosso, Paul
> Sent: Wednesday, 2009 March 11 11:32
> To: Phillips, Addison; public-xml-core-wg@w3.org
> Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> Subject: RE: Unicode Normalization in XML 1.0 5e
> 
> Addison et al.,
> 
> The XML Core WG has discussed your message during several
> telcons, and we are still in the process of determining
> just what we might do in response.
> 
> At this time, we are quite sure we do not want to change
> the XML spec so that canonical equivalents could be treated 
> as identical directly in XML.  Aside from being a serious
> change to parser behavior, this would make some previously
> ill-formed (non-XML) documents well-formed XML as well as
> make some previously well-formed XML ill-formed (non-XML).
> 
> We are also pretty sure it would be a good idea to add at
> least a note to the XML 1.0 spec saying that XML producers
> SHOULD produce normalized output.
> 
> We are considering whether we should add (some version of)
> what the XML 1.1 spec says about normalization checking [1]
> to the XML 1.0 spec.  We haven't made a decision here yet,
> and given our biweekly telcon schedule and the upcoming AC
> meeting, we are not likely to do so until some time in April.
> 
> I will, of course, let you know when we have a further status
> update to give you.
> 
> regards,
> 
> paul
> 
> for the XML Core WG
> 
> [1] http://www.w3.org/TR/xml11/#sec-normalization-checking
> 
> > -----Original Message-----
> > From: public-xml-core-wg-request@w3.org 
> > [mailto:public-xml-core-wg-request@w3.org] On Behalf Of 
> > Phillips, Addison
> > Sent: Wednesday, 2009 February 25 0:17
> > To: public-xml-core-wg@w3.org
> > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > Subject: Unicode Normalization in XML 1.0 5e
> > 
> > Dear XML Core WG,
> > 
> > I am writing on behalf of both the Internationalization Core 
> > WG and the HTML Coordination Group (HCG).
> > 
> > Recently there has been an extensive discussion of 
> > normalization in W3C specifications, mainly related to 
> > handling of element and attribute names and values (as in 
> > CSS3 Selectors). Some of this discussion revolves around how 
> > Unicode normalization should work with XML and XML-derived 
> > specifications, hence I was actioned by HCG [0] to contact 
> you folks.
> > 
> > I produced a general summary of the Unicode normalization 
> > problem at [1] for the HCG. Those unfamiliar with Unicode 
> > normalization may wish to review that message.
> > 
> > The basic question is whether XML can (or should?) take a 
> > clearer stance on Unicode normalization. At present, XML 1.0 
> > 5e, like its predecessors, does not require any particular 
> > normalization form; it says nothing about whether canonical 
> > equivalents in Unicode are "equal" from an XML point of view; 
> > and thus implies that Unicode canonical equivalence does 
> > *not* apply when considering an XML document's formation. The 
> > recommendations in Appendix J (which does include 
> > normalization among its suggestions) further suggest that 
> > this is true.
> > 
> > On the other hand, it seems reasonable to suppose that 
> > Unicode canonical equivalence might apply to XML. Processes 
> > such as transcoding legacy charsets to Unicode might result 
> > in canonically-equivalent-but-unequal code point sequences, 
> > for example. 
> > 
> > In a survey done at I18N's behest, our Unicode liaison (Mark 
> > Davis) produced a survey of content of the Web, as well as a 
> > summary on performance [2], which found that 99.98% of Web 
> > HTML content was, in fact, in Unicode form NFC. It seems 
> > reasonable to suppose that XML content and documents would 
> > follow a similar pattern. 
> > 
> > Our questions to XML Core WG, thus, are:
> > 
> >    What, precisely, should XML say with regard to Unicode 
> > canonical equivalence?
> > 
> >    Would it be possible to require or allow canonical 
> > equivalents to be treated as identical directly in XML (and 
> > not merely as a side effect of other specifications)?
> > 
> >    Is there a problem if XML permits/requires 
> > canonically-equivalent-yet-different sequences to be treated 
> > as distinct if other specifications require/allow canonical 
> > equivalence to be recognized?
> > 
> > The Internationalization Core WG would be happy to work with 
> > you on these thorny issues. Please advise if you need more 
> > information, consultation, participation, or just need to vent :-).
> > 
> > Kind Regards,
> > 
> > Addison (for I18N/HCG)
> > 
> > 
> > [0] 
> > http://lists.w3.org/Archives/Member/w3c-html-cg/2009JanMar/0061.html
> >     See ACTION-29
> > [1] 
> > http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMa
> > r/0259.html
> > [2] http://www.macchiato.com/unicode/nfc-faq
> > 
> > 
> > Addison Phillips
> > Globalization Architect -- Lab126
> > Chair -- W3C Internationalization WG
> > 
> > Internationalization is not a feature.
> > It is an architecture.
> > 
> > 
> > 
> 
> 

Received on Wednesday, 22 April 2009 17:04:48 UTC