Re: Request for clarification on Canonical XML

I think what would help here would be to define a (referencable) canonical 
character sequence form, and use that to define the canonical octet 
sequence representation.

#g
--

At 16:04 24/07/03 -0400, Martin Duerst wrote:

>Hello Joseph, dear XML Signature specialists,
>
>I would like to request a clarification on Canonical XML
>(http://www.w3.org/TR/2001/REC-xml-c14n-20010315).
>
>At http://www.w3.org/TR/2001/REC-xml-c14n-20010315#Terminology,
>Canonical XML says:
>
> >>>>
>The canonical form of an XML document is physical representation of the 
>document produced by the method described in this specification. The 
>changes are summarized in the following list:
>
>* The document is encoded in UTF-8
> >>>>
>
>There are numerous applications (parser testing, digital signatures,
>encryption) where it is important to have an actual physical representation
>for simple octet comparison or for input to cryptographic algorithms
>that usually take octet streams as inputs.
>
>However, it has recently come to my attention that there are also some
>attempts to use Canonical XML (or Exclusive XML Canonicalisation, which
>inherits this aspect of its definition from Canonical XML) in other
>situations, such as purely abstract modeling and comparison of XML
>documents or XML fragments, or for API definitions.
>
>For abstract modeling, the encoding in UTF-8 irrelevant and confusing.
>For API definitions, specifying the encoding is crucial, but it may
>be counterproductive to use UTF-8 in a context where UTF-16 is
>widely used (e.g. Java, Python).
>
>Although not appropriate in these cases, it seems to be difficult for
>users of Canonical XML to abstract them from UTF-8 where necessary.
>I therefore propose to add some clarification. As a first actual
>text proposal (may need some additional work), I propose to
>add a note at the end of Section 1.1, Terminology:
>
>Note: Canonical XML is defined here in terms of a physical (octet-based)
>representation. This is appropriate for many applications, ranging from
>digital signatures to parser testing. However, there are cases where a
>physical representation is not needed, and there are cases where another
>physical representation is appropriate. As an example, it may be
>aproriate to choose UTF-16 rather than UTF-8 as the encoding of an
>API in a programming language using UTF-16 to represent Unicode strings,
>such as Java or Python. In such cases, users of Canonical XML should
>abstract from the physical character encoding if they note this
>appropriately.
>
>I'm sure there is a better way to word this.
>
>
>Regards,    Martin.

-------------------
Graham Klyne
<GK@NineByNine.org>
PGP: 0FAA 69FF C083 000B A2E9  A131 01B9 1C7A DBCA CB5E

Received on Friday, 25 July 2003 05:12:48 UTC