W3C home > Mailing lists > Public > www-xml-xinclude-comments@w3.org > September 2001

Unicode character normalization when parse="text"

From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
Date: Sun, 2 Sep 2001 11:04:42 -0400
Message-Id: <p04330104b7b7f941b2a2@[192.168.254.4]>
To: www-xml-xinclude-comments@w3.org
I would like to suggest eliminating the requirement for Unicode character normalization when including text documents. This is not required by the Infoset or even canonical XML, which arguably has much higher needs for this sort of thing. To quote from the Canonical XML spec, http://www.w3.org/TR/xml-c14n#NoCharModelNorm

The Unicode standard [Unicode] allows multiple different representations
of certain "precomposed characters" (a simple example is ""). Thus two
XML documents with content that is equivalent for the purposes of most
applications may contain differing character sequences. The W3C is
preparing a normalized representation [CharModel]. The C14N-20000119
Canonical XML draft used this normalized form. However, many XML 1.0
processors do not perform this normalization. Furthermore, applications
that must solve this problem typically enforce character model
normalization at all times starting when character content is created in
order to avoid processing failures that could otherwise result (e.g. see
example from Cowan). Therefore, character model normalization has been
moved out of scope for XML canonicalization. However, the XML processor
used to prepare the XPath data model input is required (by the Data
Model) to use Normalization Form C [NFC, NFC-Corrigendum] when
converting an XML document to the UCS character domain from any encoding
that is not UCS-based (currently, UCS-based encodings include UTF-8,
UTF-16, UTF-16BE, and UTF-16LE, UCS-2, and UCS-4).

I suggest that XInclude follow the lead of Canonical XML here, and not perform Unicode normalization of documents that already exist in an Unicode form. Honestly, I'd prefer it to go a little further and not require any form of Unicode normalization at any time including when converting from a non-Unicode format. The implementation burden just seems too high for the benefits achieved. 
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+ 
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      | 
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
+----------------------------------+---------------------------------+
Received on Sunday, 2 September 2001 11:15:10 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:09:31 UTC