Request for clarification on Canonical XML from Martin Duerst on 2003-07-24 (w3c-rdfcore-wg@w3.org from July 2003)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 24 Jul 2003 16:04:57 -0400
To: w3c-ietf-xmldsig@w3.org
Cc: w3c-i18n-ig@w3.org, w3c-rdfcore-wg@w3.org, "Peter F. " Patel-Schneider <pfps@research.bell-labs.com>
Message-Id: <4.2.0.58.J.20030724151248.051c7618@localhost>

Hello Joseph, dear XML Signature specialists,

I would like to request a clarification on Canonical XML
(http://www.w3.org/TR/2001/REC-xml-c14n-20010315).

At http://www.w3.org/TR/2001/REC-xml-c14n-20010315#Terminology,
Canonical XML says:

 >>>>
The canonical form of an XML document is physical representation of the 
document produced by the method described in this specification. The 
changes are summarized in the following list:

* The document is encoded in UTF-8
 >>>>

There are numerous applications (parser testing, digital signatures,
encryption) where it is important to have an actual physical representation
for simple octet comparison or for input to cryptographic algorithms
that usually take octet streams as inputs.

However, it has recently come to my attention that there are also some
attempts to use Canonical XML (or Exclusive XML Canonicalisation, which
inherits this aspect of its definition from Canonical XML) in other
situations, such as purely abstract modeling and comparison of XML
documents or XML fragments, or for API definitions.

For abstract modeling, the encoding in UTF-8 irrelevant and confusing.
For API definitions, specifying the encoding is crucial, but it may
be counterproductive to use UTF-8 in a context where UTF-16 is
widely used (e.g. Java, Python).

Although not appropriate in these cases, it seems to be difficult for
users of Canonical XML to abstract them from UTF-8 where necessary.
I therefore propose to add some clarification. As a first actual
text proposal (may need some additional work), I propose to
add a note at the end of Section 1.1, Terminology:

Note: Canonical XML is defined here in terms of a physical (octet-based)
representation. This is appropriate for many applications, ranging from
digital signatures to parser testing. However, there are cases where a
physical representation is not needed, and there are cases where another
physical representation is appropriate. As an example, it may be
aproriate to choose UTF-16 rather than UTF-8 as the encoding of an
API in a programming language using UTF-16 to represent Unicode strings,
such as Java or Python. In such cases, users of Canonical XML should
abstract from the physical character encoding if they note this
appropriately.

I'm sure there is a better way to word this.


Regards,    Martin.

Received on Thursday, 24 July 2003 16:06:27 UTC