RE: Unicode Normalization in XML 1.0 5e from Phillips, Addison on 2009-04-30 (public-xml-core-wg@w3.org from April 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Thu, 30 Apr 2009 09:42:18 -0700
To: "Grosso, Paul" <pgrosso@ptc.com>, "public-xml-core-wg@w3.org" <public-xml-core-wg@w3.org>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "w3c-html-cg@w3.org" <w3c-html-cg@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA019FE3420D@EX-SEA5-D.ant.amazon.com>
Hello Paul & XML WG,

At our most recent teleconference [1], the Internationalization WG discussed your email below regarding normalization in XML. We have scheduled time in our next teleconference (scheduled for 6 May 2009) to finalize a response for you.

Our initial reaction is that we are not quite satisfied with the proposed text: we think a stronger health warning is probably called for here and would like to suggest one. Also, please note that the reference(s) to CharMod need to be updated, as Martin Dürst kindly pointed out in [2].

Kind regards,

Addison (for I18N)

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.


[1] http://www.w3.org/2009/04/29-core-minutes.html

[2] http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJun/0037.html 


> -----Original Message-----
> From: Grosso, Paul [mailto:pgrosso@ptc.com]
> Sent: Wednesday, April 22, 2009 10:03 AM
> To: Phillips, Addison; public-xml-core-wg@w3.org
> Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> Subject: RE: Unicode Normalization in XML 1.0 5e
> 
> Addison et al.,
> 
> Regarding this issue, the XML Core WG plans to issue
> an erratum to XML 1.0 5th Edition that adds a note
> as follows (where things delimited by underscores should
> be links to the appropriate definition or reference)
> to the end of section 2.2 Characters in XML 1.0:
> 
>  Note:
> 
>  All XML _parsed entities_ (including _document entities_) SHOULD
>  be fully normalized as per _[CharMod]_.
> 
>  However, a document is still well-formed even if it is not fully
>  normalized. XML processors MAY verify that the document being
>  processed is in fully normalized form and report to the
> application
>  whether it is or not.
> 
> Then we would also add to A.2 Other References in XML 1.0:
> 
>  Charmod
>     W3C. Character Model for the World Wide Web 1.0.
>     Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf,
>     Tex Texin. (See http://www.w3.org/TR/2005/REC-charmod-

> 20050215/.)
> 
> Please let us know if this resolution of your issue is acceptable.
> 
> regards,
> 
> paul
> 
> Paul Grosso for the XML Core WG
> 
> > -----Original Message-----
> > From: public-xml-core-wg-request@w3.org
> > [mailto:public-xml-core-wg-request@w3.org] On Behalf Of Grosso,
> Paul
> > Sent: Wednesday, 2009 March 11 11:32
> > To: Phillips, Addison; public-xml-core-wg@w3.org
> > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > Subject: RE: Unicode Normalization in XML 1.0 5e
> >
> > Addison et al.,
> >
> > The XML Core WG has discussed your message during several
> > telcons, and we are still in the process of determining
> > just what we might do in response.
> >
> > At this time, we are quite sure we do not want to change
> > the XML spec so that canonical equivalents could be treated
> > as identical directly in XML.  Aside from being a serious
> > change to parser behavior, this would make some previously
> > ill-formed (non-XML) documents well-formed XML as well as
> > make some previously well-formed XML ill-formed (non-XML).
> >
> > We are also pretty sure it would be a good idea to add at
> > least a note to the XML 1.0 spec saying that XML producers
> > SHOULD produce normalized output.
> >
> > We are considering whether we should add (some version of)
> > what the XML 1.1 spec says about normalization checking [1]
> > to the XML 1.0 spec.  We haven't made a decision here yet,
> > and given our biweekly telcon schedule and the upcoming AC
> > meeting, we are not likely to do so until some time in April.
> >
> > I will, of course, let you know when we have a further status
> > update to give you.
> >
> > regards,
> >
> > paul
> >
> > for the XML Core WG
> >
> > [1] http://www.w3.org/TR/xml11/#sec-normalization-checking

> >
> > > -----Original Message-----
> > > From: public-xml-core-wg-request@w3.org
> > > [mailto:public-xml-core-wg-request@w3.org] On Behalf Of
> > > Phillips, Addison
> > > Sent: Wednesday, 2009 February 25 0:17
> > > To: public-xml-core-wg@w3.org
> > > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > > Subject: Unicode Normalization in XML 1.0 5e
> > >
> > > Dear XML Core WG,
> > >
> > > I am writing on behalf of both the Internationalization Core
> > > WG and the HTML Coordination Group (HCG).
> > >
> > > Recently there has been an extensive discussion of
> > > normalization in W3C specifications, mainly related to
> > > handling of element and attribute names and values (as in
> > > CSS3 Selectors). Some of this discussion revolves around how
> > > Unicode normalization should work with XML and XML-derived
> > > specifications, hence I was actioned by HCG [0] to contact
> > you folks.
> > >
> > > I produced a general summary of the Unicode normalization
> > > problem at [1] for the HCG. Those unfamiliar with Unicode
> > > normalization may wish to review that message.
> > >
> > > The basic question is whether XML can (or should?) take a
> > > clearer stance on Unicode normalization. At present, XML 1.0
> > > 5e, like its predecessors, does not require any particular
> > > normalization form; it says nothing about whether canonical
> > > equivalents in Unicode are "equal" from an XML point of view;
> > > and thus implies that Unicode canonical equivalence does
> > > *not* apply when considering an XML document's formation. The
> > > recommendations in Appendix J (which does include
> > > normalization among its suggestions) further suggest that
> > > this is true.
> > >
> > > On the other hand, it seems reasonable to suppose that
> > > Unicode canonical equivalence might apply to XML. Processes
> > > such as transcoding legacy charsets to Unicode might result
> > > in canonically-equivalent-but-unequal code point sequences,
> > > for example.
> > >
> > > In a survey done at I18N's behest, our Unicode liaison (Mark
> > > Davis) produced a survey of content of the Web, as well as a
> > > summary on performance [2], which found that 99.98% of Web
> > > HTML content was, in fact, in Unicode form NFC. It seems
> > > reasonable to suppose that XML content and documents would
> > > follow a similar pattern.
> > >
> > > Our questions to XML Core WG, thus, are:
> > >
> > >    What, precisely, should XML say with regard to Unicode
> > > canonical equivalence?
> > >
> > >    Would it be possible to require or allow canonical
> > > equivalents to be treated as identical directly in XML (and
> > > not merely as a side effect of other specifications)?
> > >
> > >    Is there a problem if XML permits/requires
> > > canonically-equivalent-yet-different sequences to be treated
> > > as distinct if other specifications require/allow canonical
> > > equivalence to be recognized?
> > >
> > > The Internationalization Core WG would be happy to work with
> > > you on these thorny issues. Please advise if you need more
> > > information, consultation, participation, or just need to
> vent :-).
> > >
> > > Kind Regards,
> > >
> > > Addison (for I18N/HCG)
> > >
> > >
> > > [0]
> > > http://lists.w3.org/Archives/Member/w3c-html-

> cg/2009JanMar/0061.html
> > >     See ACTION-29
> > > [1]
> > > http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMa

> > > r/0259.html
> > > [2] http://www.macchiato.com/unicode/nfc-faq

> > >
> > >
> > > Addison Phillips
> > > Globalization Architect -- Lab126
> > > Chair -- W3C Internationalization WG
> > >
> > > Internationalization is not a feature.
> > > It is an architecture.
> > >
> > >
> > >
> >
> >
Received on Thursday, 30 April 2009 16:42:56 UTC