RE: Unicode Normalization in XML 1.0 5e

Addison,

Will you be sending any further input in time for the XML Core WG
telcon this Wednesday?

paul 

> -----Original Message-----
> From: Phillips, Addison [mailto:addison@amazon.com] 
> Sent: Thursday, 2009 April 30 13:00
> To: Grosso, Paul; public-xml-core-wg@w3.org
> Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> Subject: RE: Unicode Normalization in XML 1.0 5e
> 
> Hi Paul,
> 
> Thanks for the note and additional background information. 
> It's quite helpful. 
> 
> We appreciate the restrictions on XML 1.0. I think our 
> concern is not that we want new features, but rather to 
> clarify specifically what the "old features" are and, as 
> necessary, provide useful health warnings for end users.
> 
> The unpleasant task before our WG is that, given that XML 
> considers two (Unicode) "canonically equivalent" elements 
> represented by different code point sequences to be distinct, 
> how or when should we encourage or insist that other Specs 
> built upon XML normalize items for string identity 
> operations? Clearly specs like XPath and the like are "in 
> trouble" if they normalize (can't select certain discrete 
> elements discretely) and "in trouble" in a different way if 
> they don't (a user's request in one place, even though 
> canonically equivalent to that in the XML document being 
> processed, doesn't match).
> 
> This suggests that early normalization is a requirement for 
> certain kinds of XML operation to be reliable, an idea that 
> is already unpopular with implementers :-).
> 
> Regards,
> 
> Addison
> 
> Addison Phillips
> Globalization Architect -- Lab126
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 
> > -----Original Message-----
> > From: Grosso, Paul [mailto:pgrosso@ptc.com]
> > Sent: Thursday, April 30, 2009 10:35 AM
> > To: Phillips, Addison; public-xml-core-wg@w3.org
> > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > Subject: RE: Unicode Normalization in XML 1.0 5e
> > 
> > Addison,
> > 
> > Thanks for your update.  Please allow me to give you some
> > more background for your WG's discussion.
> > 
> > Whereas XML 1.1 does include a normalization checking
> > option, we cannot add such a feature to XML 1.0.  At
> > http://www.w3.org/TR/xml11/#sec-normalization-checking
> > XML 1.1 starts with a sentence that is basically the
> > first paragraph of the note we propose below (with a
> > reference to CharMod).
> > 
> > Then it follows with what is basically the first sentence
> > of the second paragraph of the note proposed below.  That
> > paragraph in XML 1.1 goes on to talk about a user option.
> > 
> > Our staff contact has informed us that we cannot do something
> > that is effectively introducing a new feature into the
> > language, and a user option is a new feature.  Hence the
> > rest of the second paragraph in our proposed note suggests
> > that processors may do what, in XML 1.1, is allowed by
> > user option.
> > 
> > regards,
> > 
> > paul
> > 
> > > -----Original Message-----
> > > From: Phillips, Addison [mailto:addison@amazon.com]
> > > Sent: Thursday, 2009 April 30 11:42
> > > To: Grosso, Paul; public-xml-core-wg@w3.org
> > > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > > Subject: RE: Unicode Normalization in XML 1.0 5e
> > >
> > > Hello Paul & XML WG,
> > >
> > > At our most recent teleconference [1], the
> > > Internationalization WG discussed your email below regarding
> > > normalization in XML. We have scheduled time in our next
> > > teleconference (scheduled for 6 May 2009) to finalize a
> > > response for you.
> > >
> > > Our initial reaction is that we are not quite satisfied with
> > > the proposed text: we think a stronger health warning is
> > > probably called for here and would like to suggest one. Also,
> > > please note that the reference(s) to CharMod need to be
> > > updated, as Martin Dürst kindly pointed out in [2].
> > >
> > > Kind regards,
> > >
> > > Addison (for I18N)
> > >
> > > Addison Phillips
> > > Globalization Architect -- Lab126
> > > Chair -- W3C Internationalization WG
> > >
> > > Internationalization is not a feature.
> > > It is an architecture.
> > >
> > >
> > > [1] http://www.w3.org/2009/04/29-core-minutes.html
> > > [2]
> > > http://lists.w3.org/Archives/Public/public-i18n-core/2009AprJu
> > > n/0037.html
> > >
> > >
> > > > -----Original Message-----
> > > > From: Grosso, Paul [mailto:pgrosso@ptc.com]
> > > > Sent: Wednesday, April 22, 2009 10:03 AM
> > > > To: Phillips, Addison; public-xml-core-wg@w3.org
> > > > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > > > Subject: RE: Unicode Normalization in XML 1.0 5e
> > > >
> > > > Addison et al.,
> > > >
> > > > Regarding this issue, the XML Core WG plans to issue
> > > > an erratum to XML 1.0 5th Edition that adds a note
> > > > as follows (where things delimited by underscores should
> > > > be links to the appropriate definition or reference)
> > > > to the end of section 2.2 Characters in XML 1.0:
> > > >
> > > >  Note:
> > > >
> > > >  All XML _parsed entities_ (including _document entities_)
> > SHOULD
> > > >  be fully normalized as per _[CharMod]_.
> > > >
> > > >  However, a document is still well-formed even if it is not
> > fully
> > > >  normalized. XML processors MAY verify that the document being
> > > >  processed is in fully normalized form and report to the
> > > > application
> > > >  whether it is or not.
> > > >
> > > > Then we would also add to A.2 Other References in XML 1.0:
> > > >
> > > >  Charmod
> > > >     W3C. Character Model for the World Wide Web 1.0.
> > > >     Martin J. Dürst, François Yergeau, Richard Ishida, Misha
> > Wolf,
> > > >     Tex Texin. (See http://www.w3.org/TR/2005/REC-charmod-
> > > > 20050215/.)
> > > >
> > > > Please let us know if this resolution of your issue is
> > acceptable.
> > > >
> > > > regards,
> > > >
> > > > paul
> > > >
> > > > Paul Grosso for the XML Core WG
> > > >
> > > > > -----Original Message-----
> > > > > From: public-xml-core-wg-request@w3.org
> > > > > [mailto:public-xml-core-wg-request@w3.org] On Behalf Of
> > Grosso,
> > > > Paul
> > > > > Sent: Wednesday, 2009 March 11 11:32
> > > > > To: Phillips, Addison; public-xml-core-wg@w3.org
> > > > > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > > > > Subject: RE: Unicode Normalization in XML 1.0 5e
> > > > >
> > > > > Addison et al.,
> > > > >
> > > > > The XML Core WG has discussed your message during several
> > > > > telcons, and we are still in the process of determining
> > > > > just what we might do in response.
> > > > >
> > > > > At this time, we are quite sure we do not want to change
> > > > > the XML spec so that canonical equivalents could be treated
> > > > > as identical directly in XML.  Aside from being a serious
> > > > > change to parser behavior, this would make some previously
> > > > > ill-formed (non-XML) documents well-formed XML as well as
> > > > > make some previously well-formed XML ill-formed (non-XML).
> > > > >
> > > > > We are also pretty sure it would be a good idea to add at
> > > > > least a note to the XML 1.0 spec saying that XML producers
> > > > > SHOULD produce normalized output.
> > > > >
> > > > > We are considering whether we should add (some version of)
> > > > > what the XML 1.1 spec says about normalization checking [1]
> > > > > to the XML 1.0 spec.  We haven't made a decision here yet,
> > > > > and given our biweekly telcon schedule and the upcoming AC
> > > > > meeting, we are not likely to do so until some time in April.
> > > > >
> > > > > I will, of course, let you know when we have a further status
> > > > > update to give you.
> > > > >
> > > > > regards,
> > > > >
> > > > > paul
> > > > >
> > > > > for the XML Core WG
> > > > >
> > > > > [1] http://www.w3.org/TR/xml11/#sec-normalization-checking
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: public-xml-core-wg-request@w3.org
> > > > > > [mailto:public-xml-core-wg-request@w3.org] On Behalf Of
> > > > > > Phillips, Addison
> > > > > > Sent: Wednesday, 2009 February 25 0:17
> > > > > > To: public-xml-core-wg@w3.org
> > > > > > Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
> > > > > > Subject: Unicode Normalization in XML 1.0 5e
> > > > > >
> > > > > > Dear XML Core WG,
> > > > > >
> > > > > > I am writing on behalf of both the Internationalization
> > Core
> > > > > > WG and the HTML Coordination Group (HCG).
> > > > > >
> > > > > > Recently there has been an extensive discussion of
> > > > > > normalization in W3C specifications, mainly related to
> > > > > > handling of element and attribute names and values (as in
> > > > > > CSS3 Selectors). Some of this discussion revolves around
> > how
> > > > > > Unicode normalization should work with XML and XML-derived
> > > > > > specifications, hence I was actioned by HCG [0] to contact
> > > > > you folks.
> > > > > >
> > > > > > I produced a general summary of the Unicode normalization
> > > > > > problem at [1] for the HCG. Those unfamiliar with Unicode
> > > > > > normalization may wish to review that message.
> > > > > >
> > > > > > The basic question is whether XML can (or should?) take a
> > > > > > clearer stance on Unicode normalization. At present, XML
> > 1.0
> > > > > > 5e, like its predecessors, does not require any particular
> > > > > > normalization form; it says nothing about whether canonical
> > > > > > equivalents in Unicode are "equal" from an XML point of
> > view;
> > > > > > and thus implies that Unicode canonical equivalence does
> > > > > > *not* apply when considering an XML document's formation.
> > The
> > > > > > recommendations in Appendix J (which does include
> > > > > > normalization among its suggestions) further suggest that
> > > > > > this is true.
> > > > > >
> > > > > > On the other hand, it seems reasonable to suppose that
> > > > > > Unicode canonical equivalence might apply to XML. Processes
> > > > > > such as transcoding legacy charsets to Unicode might result
> > > > > > in canonically-equivalent-but-unequal code point sequences,
> > > > > > for example.
> > > > > >
> > > > > > In a survey done at I18N's behest, our Unicode liaison
> > (Mark
> > > > > > Davis) produced a survey of content of the Web, as well as
> > a
> > > > > > summary on performance [2], which found that 99.98% of Web
> > > > > > HTML content was, in fact, in Unicode form NFC. It seems
> > > > > > reasonable to suppose that XML content and documents would
> > > > > > follow a similar pattern.
> > > > > >
> > > > > > Our questions to XML Core WG, thus, are:
> > > > > >
> > > > > >    What, precisely, should XML say with regard to Unicode
> > > > > > canonical equivalence?
> > > > > >
> > > > > >    Would it be possible to require or allow canonical
> > > > > > equivalents to be treated as identical directly in XML (and
> > > > > > not merely as a side effect of other specifications)?
> > > > > >
> > > > > >    Is there a problem if XML permits/requires
> > > > > > canonically-equivalent-yet-different sequences to be
> > treated
> > > > > > as distinct if other specifications require/allow canonical
> > > > > > equivalence to be recognized?
> > > > > >
> > > > > > The Internationalization Core WG would be happy to work
> > with
> > > > > > you on these thorny issues. Please advise if you need more
> > > > > > information, consultation, participation, or just need to
> > > > vent :-).
> > > > > >
> > > > > > Kind Regards,
> > > > > >
> > > > > > Addison (for I18N/HCG)
> > > > > >
> > > > > >
> > > > > > [0]
> > > > > > http://lists.w3.org/Archives/Member/w3c-html-
> > > > cg/2009JanMar/0061.html
> > > > > >     See ACTION-29
> > > > > > [1]
> > > > > > http://lists.w3.org/Archives/Public/public-i18n-
> > core/2009JanMa
> > > > > > r/0259.html
> > > > > > [2] http://www.macchiato.com/unicode/nfc-faq
> > > > > >
> > > > > >
> > > > > > Addison Phillips
> > > > > > Globalization Architect -- Lab126
> > > > > > Chair -- W3C Internationalization WG
> > > > > >
> > > > > > Internationalization is not a feature.
> > > > > > It is an architecture.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > >
> 

Received on Monday, 18 May 2009 15:33:20 UTC