RE: Request for clarification on Canonical XML from John Boyer on 2003-07-29 (w3c-rdfcore-wg@w3.org from July 2003)

From: John Boyer <JBoyer@PureEdge.com>
Date: Tue, 29 Jul 2003 12:44:27 -0700
To: "Martin Duerst" <duerst@w3.org>, "Joseph Reagle" <reagle@w3.org>, <w3c-ietf-xmldsig@w3.org>
Cc: <w3c-i18n-ig@w3.org>, <w3c-rdfcore-wg@w3.org>, "Peter F. \" Patel-Schneider" <pfps@research.bell-labs.com>
Message-ID: <7874BFCCD289A645B5CE3935769F0B5245366A@tigger.pureedge.com>

Hi Martin and others,

Whether the c14n spec is "incorrect or needs further qualification" regarding this character sequence issue is debatable to me.  It depends somewhat on the connotation of 'further qualification', which in the sentence comes across as 'almost incorrect' or 'partly mistaken'.

There is nothing mistaken about what c14n does.  It produces a canonical form for XML. Canonicalization in general is about picking one specific way of doing everything.  Regarding encoding, we picked one way: UTF-8.  We made this choice in part because encoding is part of the definition of XML 1.0 according to the Recommendation itself, so we wanted to pick one way of doing it.  We chose UTF-8 because it was the most supported encoding, as guaranteed by XML 1.0 itself.  We also made this choice because we wanted as a requirement be able to use the output of a canonicalizer as the input to a canonicalizer, thereby allowing statements like c14n(doc)==c14n(c14n(doc)).

Right now, you are discussing the use of other encodings because it is logically equivalent in your application context yet more efficient to do so, not because there is anything incorrect, in whole or in part, with the method by which the current c14n achieves its primary goal.

The fact that comparison of the UTF-16 encoding of an XML document might be equivalent to the UTF-8 encoding is an implementation detail.  As long as the implementer can guarantee that what they are doing is completely equivalent within the application context to comparing the results produced by C14N, then the implementer can choose to implement in this fashion.

John Boyer, Ph.D.
Senior Product Architect and Research Scientist
PureEdge Solutions Inc.

-----Original Message-----
From: Martin Duerst [mailto:duerst@w3.org]
Sent: Monday, July 28, 2003 10:53 AM
To: Joseph Reagle; w3c-ietf-xmldsig@w3.org
Cc: w3c-i18n-ig@w3.org; w3c-rdfcore-wg@w3.org; Peter F. "
Patel-Schneider
Subject: Re: Request for clarification on Canonical XML

At 11:35 03/07/28 -0400, Joseph Reagle wrote:

>On Thursday 24 July 2003 16:04, Martin Duerst wrote:
> > The canonical form of an XML document is physical representation of the
> > document produced by the method described in this specification. The
> > changes are summarized in the following list:
>
>Hi Martin, had this issue come up while we were writing the spec I'm
>confident we could have provided the clarity, or maybe even an additional
>definition of a "canonical character sequence form" as Graham suggested,
>that you are seeking.

I'm not seeking the addition of a new definition. I don't think it would
be appropriate to add new definitions without starting a new WD-REC
cycle, and I don't think this is important enough to do this.

>However, I think it would be inappropriate to do such
>a definition now, and I'm not sure how to even add a "note" as an erratum.
>It doesn't quite fit into "a Caveat where subsequent experience has shown
>that a recommendation of the specification was incorrect or needs further
>qualification." [1]

I guess it comes sufficiently close, the 'further qualification'
seems quite adequate.

>I don't object to the spirit of your text, and have tweaked it below:
>
>[[[
>Note: Canonical XML is an octet sequence resulting from characters, from the
>UCS character domain, encoded in UTF-8. This is necessary for the purposes
>of XML Signature and other applications. However, some applications may
>require a canonical form of XML that is a sequence of characters, without
>concern for its encoding and representation as octets. As an example, it
>may be appropriate to choose UTF-16 rather than UTF-8 as the encoding of an
>API in a programming language using UTF-16 to represent Unicode strings,
>such as Java or Python. In such cases, applications are not prohibited from
>defining and using a canonical character sequence that corresponds to the
>characters of a Canonical XML instance.
>]]]

The current text is slightly problematic because it says 'without concern
for its encoding' and then goes straight on to mention UTF-16. UTF-16
indeed does not deal with octets, but it is still an encoding.
Also, this version of the text doesn't mention abstract modeling anymore.
It might also be better to replace 'may require' with 'may be better
served with'.

Regards,    Martin.

>I'm not sure if this is any better, and I'm not confident it should be an
>erratum, but perhaps you could use this thread in your discussions with
>others about whether the octet representation is really needed?
>
>[1] http://www.w3.org/2001/03/C14N-errata

Received on Tuesday, 29 July 2003 15:44:34 UTC