RE: Unicode Normalization in XML 1.0 5e from Phillips, Addison on 2009-04-30 (public-xml-core-wg@w3.org from April 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Thu, 30 Apr 2009 11:00:06 -0700
To: "Grosso, Paul" <pgrosso@ptc.com>, "public-xml-core-wg@w3.org" <public-xml-core-wg@w3.org>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "w3c-html-cg@w3.org" <w3c-html-cg@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA019FE343A6@EX-SEA5-D.ant.amazon.com>
Hi Paul,

Thanks for the note and additional background information. It's quite helpful. 

We appreciate the restrictions on XML 1.0. I think our concern is not that we want new features, but rather to clarify specifically what the "old features" are and, as necessary, provide useful health warnings for end users.

The unpleasant task before our WG is that, given that XML considers two (Unicode) "canonically equivalent" elements represented by different code point sequences to be distinct, how or when should we encourage or insist that other Specs built upon XML normalize items for string identity operations? Clearly specs like XPath and the like are "in trouble" if they normalize (can't select certain discrete elements discretely) and "in trouble" in a different way if they don't (a user's request in one place, even though canonically equivalent to that in the XML document being processed, doesn't match).

This suggests that early normalization is a requirement for certain kinds of XML operation to be reliable, an idea that is already unpopular with implementers :-).

Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: Grosso, Paul [mailto:pgrosso@ptc.com]
> Sent: Thursday, April 30, 2009 10:35 AM
> To: Phillips, Addison; public-xml-core-wg@w3.org
> Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> Subject: RE: Unicode Normalization in XML 1.0 5e
> 
> Addison,
> 
> Thanks for your update.  Please allow me to give you some
> more background for your WG's discussion.
> 
> Whereas XML 1.1 does include a normalization checking
> option, we cannot add such a feature to XML 1.0.  At
> http://www.w3.org/TR/xml11/#sec-normalization-checking

> XML 1.1 starts with a sentence that is basically the
> first paragraph of the note we propose below (with a
> reference to CharMod).
> 
> Then it follows with what is basically the first sentence
> of the second paragraph of the note proposed below.  That
> paragraph in XML 1.1 goes on to talk about a user option.
> 
> Our staff contact has informed us that we cannot do something
> that is effectively introducing a new feature into the
> language, and a user option is a new feature.  Hence the
> rest of the second paragraph in our proposed note suggests
> that processors may do what, in XML 1.1, is allowed by
> user option.
> 
> regards,
> 
> paul
> 
> > -----Original Message-----
> > From: Phillips, Addison [mailto:addison@amazon.com]
> > Sent: Thursday, 2009 April 30 11:42
> > To: Grosso, Paul; public-xml-core-wg@w3.org
> > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > Subject: RE: Unicode Normalization in XML 1.0 5e
> >
> > Hello Paul & XML WG,
> >
> > At our most recent teleconference [1], the
> > Internationalization WG discussed your email below regarding
> > normalization in XML. We have scheduled time in our next
> > teleconference (scheduled for 6 May 2009) to finalize a
> > response for you.
> >
> > Our initial reaction is that we are not quite satisfied with
> > the proposed text: we think a stronger health warning is
> > probably called for here and would like to suggest one. Also,
> > please note that the reference(s) to CharMod need to be
> > updated, as Martin Dürst kindly pointed out in [2].
> >
> > Kind regards,
> >
> > Addison (for I18N)
> >
> > Addison Phillips
> > Globalization Architect -- Lab126
> > Chair -- W3C Internationalization WG
> >
> > Internationalization is not a feature.
> > It is an architecture.
> >
> >
> > [1] http://www.w3.org/2009/04/29-core-minutes.html

> > [2]
> > http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJu

> > n/0037.html
> >
> >
> > > -----Original Message-----
> > > From: Grosso, Paul [mailto:pgrosso@ptc.com]
> > > Sent: Wednesday, April 22, 2009 10:03 AM
> > > To: Phillips, Addison; public-xml-core-wg@w3.org
> > > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > > Subject: RE: Unicode Normalization in XML 1.0 5e
> > >
> > > Addison et al.,
> > >
> > > Regarding this issue, the XML Core WG plans to issue
> > > an erratum to XML 1.0 5th Edition that adds a note
> > > as follows (where things delimited by underscores should
> > > be links to the appropriate definition or reference)
> > > to the end of section 2.2 Characters in XML 1.0:
> > >
> > >  Note:
> > >
> > >  All XML _parsed entities_ (including _document entities_)
> SHOULD
> > >  be fully normalized as per _[CharMod]_.
> > >
> > >  However, a document is still well-formed even if it is not
> fully
> > >  normalized. XML processors MAY verify that the document being
> > >  processed is in fully normalized form and report to the
> > > application
> > >  whether it is or not.
> > >
> > > Then we would also add to A.2 Other References in XML 1.0:
> > >
> > >  Charmod
> > >     W3C. Character Model for the World Wide Web 1.0.
> > >     Martin J. Dürst, François Yergeau, Richard Ishida, Misha
> Wolf,
> > >     Tex Texin. (See http://www.w3.org/TR/2005/REC-charmod-

> > > 20050215/.)
> > >
> > > Please let us know if this resolution of your issue is
> acceptable.
> > >
> > > regards,
> > >
> > > paul
> > >
> > > Paul Grosso for the XML Core WG
> > >
> > > > -----Original Message-----
> > > > From: public-xml-core-wg-request@w3.org
> > > > [mailto:public-xml-core-wg-request@w3.org] On Behalf Of
> Grosso,
> > > Paul
> > > > Sent: Wednesday, 2009 March 11 11:32
> > > > To: Phillips, Addison; public-xml-core-wg@w3.org
> > > > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > > > Subject: RE: Unicode Normalization in XML 1.0 5e
> > > >
> > > > Addison et al.,
> > > >
> > > > The XML Core WG has discussed your message during several
> > > > telcons, and we are still in the process of determining
> > > > just what we might do in response.
> > > >
> > > > At this time, we are quite sure we do not want to change
> > > > the XML spec so that canonical equivalents could be treated
> > > > as identical directly in XML.  Aside from being a serious
> > > > change to parser behavior, this would make some previously
> > > > ill-formed (non-XML) documents well-formed XML as well as
> > > > make some previously well-formed XML ill-formed (non-XML).
> > > >
> > > > We are also pretty sure it would be a good idea to add at
> > > > least a note to the XML 1.0 spec saying that XML producers
> > > > SHOULD produce normalized output.
> > > >
> > > > We are considering whether we should add (some version of)
> > > > what the XML 1.1 spec says about normalization checking [1]
> > > > to the XML 1.0 spec.  We haven't made a decision here yet,
> > > > and given our biweekly telcon schedule and the upcoming AC
> > > > meeting, we are not likely to do so until some time in April.
> > > >
> > > > I will, of course, let you know when we have a further status
> > > > update to give you.
> > > >
> > > > regards,
> > > >
> > > > paul
> > > >
> > > > for the XML Core WG
> > > >
> > > > [1] http://www.w3.org/TR/xml11/#sec-normalization-checking

> > > >
> > > > > -----Original Message-----
> > > > > From: public-xml-core-wg-request@w3.org
> > > > > [mailto:public-xml-core-wg-request@w3.org] On Behalf Of
> > > > > Phillips, Addison
> > > > > Sent: Wednesday, 2009 February 25 0:17
> > > > > To: public-xml-core-wg@w3.org
> > > > > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > > > > Subject: Unicode Normalization in XML 1.0 5e
> > > > >
> > > > > Dear XML Core WG,
> > > > >
> > > > > I am writing on behalf of both the Internationalization
> Core
> > > > > WG and the HTML Coordination Group (HCG).
> > > > >
> > > > > Recently there has been an extensive discussion of
> > > > > normalization in W3C specifications, mainly related to
> > > > > handling of element and attribute names and values (as in
> > > > > CSS3 Selectors). Some of this discussion revolves around
> how
> > > > > Unicode normalization should work with XML and XML-derived
> > > > > specifications, hence I was actioned by HCG [0] to contact
> > > > you folks.
> > > > >
> > > > > I produced a general summary of the Unicode normalization
> > > > > problem at [1] for the HCG. Those unfamiliar with Unicode
> > > > > normalization may wish to review that message.
> > > > >
> > > > > The basic question is whether XML can (or should?) take a
> > > > > clearer stance on Unicode normalization. At present, XML
> 1.0
> > > > > 5e, like its predecessors, does not require any particular
> > > > > normalization form; it says nothing about whether canonical
> > > > > equivalents in Unicode are "equal" from an XML point of
> view;
> > > > > and thus implies that Unicode canonical equivalence does
> > > > > *not* apply when considering an XML document's formation.
> The
> > > > > recommendations in Appendix J (which does include
> > > > > normalization among its suggestions) further suggest that
> > > > > this is true.
> > > > >
> > > > > On the other hand, it seems reasonable to suppose that
> > > > > Unicode canonical equivalence might apply to XML. Processes
> > > > > such as transcoding legacy charsets to Unicode might result
> > > > > in canonically-equivalent-but-unequal code point sequences,
> > > > > for example.
> > > > >
> > > > > In a survey done at I18N's behest, our Unicode liaison
> (Mark
> > > > > Davis) produced a survey of content of the Web, as well as
> a
> > > > > summary on performance [2], which found that 99.98% of Web
> > > > > HTML content was, in fact, in Unicode form NFC. It seems
> > > > > reasonable to suppose that XML content and documents would
> > > > > follow a similar pattern.
> > > > >
> > > > > Our questions to XML Core WG, thus, are:
> > > > >
> > > > >    What, precisely, should XML say with regard to Unicode
> > > > > canonical equivalence?
> > > > >
> > > > >    Would it be possible to require or allow canonical
> > > > > equivalents to be treated as identical directly in XML (and
> > > > > not merely as a side effect of other specifications)?
> > > > >
> > > > >    Is there a problem if XML permits/requires
> > > > > canonically-equivalent-yet-different sequences to be
> treated
> > > > > as distinct if other specifications require/allow canonical
> > > > > equivalence to be recognized?
> > > > >
> > > > > The Internationalization Core WG would be happy to work
> with
> > > > > you on these thorny issues. Please advise if you need more
> > > > > information, consultation, participation, or just need to
> > > vent :-).
> > > > >
> > > > > Kind Regards,
> > > > >
> > > > > Addison (for I18N/HCG)
> > > > >
> > > > >
> > > > > [0]
> > > > > http://lists.w3.org/Archives/Member/w3c-html-

> > > cg/2009JanMar/0061.html
> > > > >     See ACTION-29
> > > > > [1]
> > > > > http://lists.w3.org/Archives/Public/public-i18n-

> core/2009JanMa
> > > > > r/0259.html
> > > > > [2] http://www.macchiato.com/unicode/nfc-faq

> > > > >
> > > > >
> > > > > Addison Phillips
> > > > > Globalization Architect -- Lab126
> > > > > Chair -- W3C Internationalization WG
> > > > >
> > > > > Internationalization is not a feature.
> > > > > It is an architecture.
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> >
Received on Thursday, 30 April 2009 18:00:50 UTC