RE: Unicode character normalization when parse="text"

Thanks for the feedback on this issue.  Others have complained about this requirement as well.  However, we feel that we must support the Character Model that emerges.  We expect there to be continuing debate on whether the Character Model should mandate normalization, and in what circumstances.  The Second Last Call Character Model is coming out soon, and hope you will comment on that draft.

Thanks, 
Jonathan Marsh

> -----Original Message-----
> From: Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu]
> Sent: Sunday, September 02, 2001 8:05 AM
> To: www-xml-xinclude-comments@w3.org
> Subject: Unicode character normalization when parse="text"
> 
> I would like to suggest eliminating the requirement for Unicode character
> normalization when including text documents. This is not required by the
> Infoset or even canonical XML, which arguably has much higher needs for
> this sort of thing. To quote from the Canonical XML spec,
> http://www.w3.org/TR/xml-c14n#NoCharModelNorm
> 
> The Unicode standard [Unicode] allows multiple different representations
> of certain "precomposed characters" (a simple example is "ç"). Thus two
> XML documents with content that is equivalent for the purposes of most
> applications may contain differing character sequences. The W3C is
> preparing a normalized representation [CharModel]. The C14N-20000119
> Canonical XML draft used this normalized form. However, many XML 1.0
> processors do not perform this normalization. Furthermore, applications
> that must solve this problem typically enforce character model
> normalization at all times starting when character content is created in
> order to avoid processing failures that could otherwise result (e.g. see
> example from Cowan). Therefore, character model normalization has been
> moved out of scope for XML canonicalization. However, the XML processor
> used to prepare the XPath data model input is required (by the Data
> Model) to use Normalization Form C [NFC, NFC-Corrigendum] when
> converting an XML document to the UCS character domain from any encoding
> that is not UCS-based (currently, UCS-based encodings include UTF-8,
> UTF-16, UTF-16BE, and UTF-16LE, UCS-2, and UCS-4).
> 
> I suggest that XInclude follow the lead of Canonical XML here, and not
> perform Unicode normalization of documents that already exist in an
> Unicode form. Honestly, I'd prefer it to go a little further and not
> require any form of Unicode normalization at any time including when
> converting from a non-Unicode format. The implementation burden just seems
> too high for the benefits achieved.
> --
> 
> +-----------------------+------------------------+-------------------+
> | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
> +-----------------------+------------------------+-------------------+
> |          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
> |              http://www.ibiblio.org/xml/books/bible2/              |
> |   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
> +----------------------------------+---------------------------------+
> |  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
> |  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
> +----------------------------------+---------------------------------+

Received on Wednesday, 9 January 2002 13:45:43 UTC