W3C home > Mailing lists > Public > public-xml-core-wg@w3.org > April 2009

Re: Unicode Normalization in XML 1.0 5e

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Sun, 26 Apr 2009 19:38:15 +0900
Message-ID: <49F43997.4040900@it.aoyama.ac.jp>
To: "Grosso, Paul" <pgrosso@ptc.com>
CC: "Phillips, Addison" <addison@amazon.com>, public-xml-core-wg@w3.org, public-i18n-core@w3.org, w3c-html-cg@w3.org
Hello Paul,

This looks good, except that instead of
http://www.w3.org/TR/2005/REC-charmod-20050215/, the reference has to be 
to http://www.w3.org/TR/charmod-norm/ (which is still a WD).

Regards,   Martin.

On 2009/04/23 2:03, Grosso, Paul wrote:
> Addison et al.,
> Regarding this issue, the XML Core WG plans to issue
> an erratum to XML 1.0 5th Edition that adds a note
> as follows (where things delimited by underscores should
> be links to the appropriate definition or reference)
> to the end of section 2.2 Characters in XML 1.0:
>   Note:
>   All XML _parsed entities_ (including _document entities_) SHOULD
>   be fully normalized as per _[CharMod]_.
>   However, a document is still well-formed even if it is not fully
>   normalized. XML processors MAY verify that the document being
>   processed is in fully normalized form and report to the application
>   whether it is or not.
> Then we would also add to A.2 Other References in XML 1.0:
>   Charmod
>      W3C. Character Model for the World Wide Web 1.0.
>      Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf,
>      Tex Texin. (See http://www.w3.org/TR/2005/REC-charmod-20050215/.)
> Please let us know if this resolution of your issue is acceptable.
> regards,
> paul
> Paul Grosso for the XML Core WG
>> -----Original Message-----
>> From: public-xml-core-wg-request@w3.org
>> [mailto:public-xml-core-wg-request@w3.org] On Behalf Of Grosso, Paul
>> Sent: Wednesday, 2009 March 11 11:32
>> To: Phillips, Addison; public-xml-core-wg@w3.org
>> Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
>> Subject: RE: Unicode Normalization in XML 1.0 5e
>> Addison et al.,
>> The XML Core WG has discussed your message during several
>> telcons, and we are still in the process of determining
>> just what we might do in response.
>> At this time, we are quite sure we do not want to change
>> the XML spec so that canonical equivalents could be treated
>> as identical directly in XML.  Aside from being a serious
>> change to parser behavior, this would make some previously
>> ill-formed (non-XML) documents well-formed XML as well as
>> make some previously well-formed XML ill-formed (non-XML).
>> We are also pretty sure it would be a good idea to add at
>> least a note to the XML 1.0 spec saying that XML producers
>> SHOULD produce normalized output.
>> We are considering whether we should add (some version of)
>> what the XML 1.1 spec says about normalization checking [1]
>> to the XML 1.0 spec.  We haven't made a decision here yet,
>> and given our biweekly telcon schedule and the upcoming AC
>> meeting, we are not likely to do so until some time in April.
>> I will, of course, let you know when we have a further status
>> update to give you.
>> regards,
>> paul
>> for the XML Core WG
>> [1] http://www.w3.org/TR/xml11/#sec-normalization-checking
>>> -----Original Message-----
>>> From: public-xml-core-wg-request@w3.org
>>> [mailto:public-xml-core-wg-request@w3.org] On Behalf Of
>>> Phillips, Addison
>>> Sent: Wednesday, 2009 February 25 0:17
>>> To: public-xml-core-wg@w3.org
>>> Cc: public-i18n-core@w3.org; w3c-html-cg@w3.org
>>> Subject: Unicode Normalization in XML 1.0 5e
>>> Dear XML Core WG,
>>> I am writing on behalf of both the Internationalization Core
>>> WG and the HTML Coordination Group (HCG).
>>> Recently there has been an extensive discussion of
>>> normalization in W3C specifications, mainly related to
>>> handling of element and attribute names and values (as in
>>> CSS3 Selectors). Some of this discussion revolves around how
>>> Unicode normalization should work with XML and XML-derived
>>> specifications, hence I was actioned by HCG [0] to contact
>> you folks.
>>> I produced a general summary of the Unicode normalization
>>> problem at [1] for the HCG. Those unfamiliar with Unicode
>>> normalization may wish to review that message.
>>> The basic question is whether XML can (or should?) take a
>>> clearer stance on Unicode normalization. At present, XML 1.0
>>> 5e, like its predecessors, does not require any particular
>>> normalization form; it says nothing about whether canonical
>>> equivalents in Unicode are "equal" from an XML point of view;
>>> and thus implies that Unicode canonical equivalence does
>>> *not* apply when considering an XML document's formation. The
>>> recommendations in Appendix J (which does include
>>> normalization among its suggestions) further suggest that
>>> this is true.
>>> On the other hand, it seems reasonable to suppose that
>>> Unicode canonical equivalence might apply to XML. Processes
>>> such as transcoding legacy charsets to Unicode might result
>>> in canonically-equivalent-but-unequal code point sequences,
>>> for example.
>>> In a survey done at I18N's behest, our Unicode liaison (Mark
>>> Davis) produced a survey of content of the Web, as well as a
>>> summary on performance [2], which found that 99.98% of Web
>>> HTML content was, in fact, in Unicode form NFC. It seems
>>> reasonable to suppose that XML content and documents would
>>> follow a similar pattern.
>>> Our questions to XML Core WG, thus, are:
>>>     What, precisely, should XML say with regard to Unicode
>>> canonical equivalence?
>>>     Would it be possible to require or allow canonical
>>> equivalents to be treated as identical directly in XML (and
>>> not merely as a side effect of other specifications)?
>>>     Is there a problem if XML permits/requires
>>> canonically-equivalent-yet-different sequences to be treated
>>> as distinct if other specifications require/allow canonical
>>> equivalence to be recognized?
>>> The Internationalization Core WG would be happy to work with
>>> you on these thorny issues. Please advise if you need more
>>> information, consultation, participation, or just need to vent :-).
>>> Kind Regards,
>>> Addison (for I18N/HCG)
>>> [0]
>>> http://lists.w3.org/Archives/Member/w3c-html-cg/2009JanMar/0061.html
>>>      See ACTION-29
>>> [1]
>>> http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMa
>>> r/0259.html
>>> [2] http://www.macchiato.com/unicode/nfc-faq
>>> Addison Phillips
>>> Globalization Architect -- Lab126
>>> Chair -- W3C Internationalization WG
>>> Internationalization is not a feature.
>>> It is an architecture.

#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Sunday, 26 April 2009 10:39:16 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:16:40 UTC