Re: Request for clarification on Canonical XML


        Here's my try at the wording:
Note: Canonical XML is an octet sequence resulting from characters, from 
UCS character domain, encoded in UTF-8. Creating a deterministic octet 
sequence is necessary for XML Signature and other applications. However, 
some applications might want a canonical form of XML in a different 
encoding, or one that is simply a sequence of characters, without concern 
for its encoding. The "canonical character form" of Canonical XML consists 
the sequence of characters resulting when the UTF-8 format defined in this
document is converted to characters.  The "canonical UCS-4 form" consists 
of the
sequence of octets produced by the conversion of the canonical character 
to UCS-4.  The "canonical UTF-16 form" consists of the sequence of octets 
produced by the conversion of the canonical character form to UTF-16.

        I have one substantive question, however.  Is there any need to 
produce a 
canonical form with less escaping than the current ones?
        If we define canonical forms in other encodings, do those 
canonicalizations need 
their own tags?

                Tom Gindin

Joseph Reagle <>
Sent by:
07/28/2003 04:39 PM

        To:     Martin Duerst <>,
        Subject:        Re: Request for clarification on Canonical XML

On Monday 28 July 2003 13:53, Martin Duerst wrote:
> The current text is slightly problematic because it says 'without 
> for its encoding' and then goes straight on to mention UTF-16. UTF-16
> indeed does not deal with octets, but it is still an encoding.

So your point is that the UTF-8 encoding can be restrictive because (1) 
may not want to use any encoding (what you mean by "abstract modeling"?) 
(2) one may want to use a different encoding. 

> Also, this version of the text doesn't mention abstract modeling 
> It might also be better to replace 'may require' with 'may be better
> served with'.

I tweaked it to "might want" so as to avoid the "MAY", but be terse and 
presume to tell them what they are better served with. <smile/> Ok, how 

Note: Canonical XML is an octet sequence resulting from characters, from 
UCS character domain, encoded in UTF-8. Creating a deterministic octet 
sequence is necessary for XML Signature and other applications. However, 
some applications might want a canonical form of XML in a different 
encoding, or one that is simply a sequence of characters, without concern 
for its encoding. For example, it may be appropriate to choose UTF-16 
rather than UTF-8 as the encoding of an API in a programming language 
UTF-16 to represent Unicode strings, such as Java or Python. Or, one might 

want to abstractly describe an XML document as an Infoset that includes 
sequences of characters. In such cases, applications are not prohibited 
from defining and using a canonical character sequence that corresponds to 

the characters of a Canonical XML instance.

Received on Wednesday, 30 July 2003 11:28:11 UTC