Re: Canonical XML error from Frederick.Hirsch@nokia.com on 2011-09-07 (w3c-ietf-xmldsig@w3.org from July to September 2011)

From: <Frederick.Hirsch@nokia.com>
Date: Wed, 7 Sep 2011 14:51:15 +0000
To: <steve.derose@openamplify.com>
CC: <Frederick.Hirsch@nokia.com>, <jboyer@PureEdge.com>, <w3c-ietf-xmldsig@w3.org>, <public-xmlsec@w3.org>, <cmsmcq@blackmesatech.com>, <ht@cogsci.ed.ac.uk>, <chris@w3.org>
Message-ID: <3246CB32-A5AF-456F-A121-522959A27E0A@nokia.com>
Steve

You are asking good questions but I think the concern is out of scope for Canonical XML, as it isn't designed to do what you are talking about.

It is  the job of an XML document author to produce well-formed XML before any considerations of signing/encryption and XML Canonicalization. Any required escaping happens before security processing, and there are a variety of choices that can be made for such escaping, as well as other representation of information. Canonical XML is agnostic to these choices.

This is pointed out in the document as a limitation:

[[
1.3 Limitations

Two XML documents may have differing information content that is nonetheless logically equivalent within a given application context. Although two XML documents are equivalent (aside from limitations given in this section) if their canonical forms are identical, it is not a goal of this work to establish a method such that two XML documents are equivalent if and only if their canonical forms are identical. Such a method is unachievable, in part due to application-specific rules such as those governing unimportant whitespace and equivalent data (e.g. <color>black</color> versus <color>rgb(0,0,0)</color>). There are also equivalencies established by other W3C Recommendations and Working Drafts. Accounting for these additional equivalence rules is beyond the scope of this work. They can be applied by the application or become the subject of future specifications.

]]

Thus if the CDATA end is escaped differently these will be different documents and have different hash results as far as Canonical XML/XML Signature are concerned, and that is what we would expect.  Canonical XML is not intended to produce a single canonical representation for all input XML documents that are logically the same - but does allow signature generation and verification to succeed for a given XML document round trip.

regards, Frederick

Frederick Hirsch
Nokia



On Sep 7, 2011, at 10:24 AM, ext Steve DeRose wrote:

Thank you for your reply. I believe you have missed the critical case. The case of interest has nothing to do with *actual* CDATA marked sections; it also has nothing to do with escaping the string that normally *starts* a marked section, in order to make it literal (those are the two cases you addressed).

Instead, the relevant case has to do with the particular string which XML reserves as the delimiter that *ends* a marked section, and how that string must be represented when occurring as literal text. Specifically, what is the canonical form of a document containing that strubg. That is a different question from either one that you address.

A relevant example is where the user wants a paragraph with this *literal* content:

In XML, the end of a marked section is indicated by "]]>".

It is an XML well-formedness error to have the literal string "]]>" as text content as just shown, because it is an XML delimiter. This is completely analogous to how it is illegal to have "<" or "&" as literal text content, because they are all XML delimiters that are recognized in text content.

In other words, this constitutes an XML well-formedness error:

<p>In XML, the end of a marked section is indicated by "]]>".</p>

The Linux 'xmlparse' command (as just one example) correctly returns:

2011-09-07 10:03:42.172 xmlparse[20298] WARNING foo is not a valid document
2011-09-07 10:03:42.174 xmlparse[20298] Errors: at line: 1 column: 57 ... Sequence ']]>' not allowed in content

To include that literal text content, an encoder must "escape" at least one of the 3 characters. Such escaping is obviously possible (if it were not, we would have dealt with it in the XML spec). For example:

<p>In XML, the end of a marked section is indicated by "]]&gt;".</p>

HOWEVER, there are *many* distinct ways to "escape" that literal string. Another example is:

<p>In XML, the end of a marked section is indicated by "&#x5d;]>".</p>

Consider: a user hands the following document to an XML Canonicalizer:

<xhtml><body><p>In XML, the end of a marked section is indicated by "&#x5d;]>".</p></body></xhtml>

Please tell me, based on Canonical XML as it is presently specified, precisely what the canonical form of that document is. Is it like the first escaped form above, or the second, or something else? What clause in Canonical XML justifies your choice? What clause ensures that if I run that document through 2 different canonicalization applications, they will produce the same "canonical" result?

It seem to me that because two conforming Canonicalization applications can produce different results from the same input, without violating the Canonical XML specification, that amounts to a bug in the spec.

Clear?

Steven J. DeRose



On Tue, 2011-09-06 at 15:43 +0000, Frederick.Hirsch@nokia.com<mailto:Frederick.Hirsch@nokia.com> wrote:
Steve


The Canonical XML Recommendation [1] states in section 1.1 and details in section 2.1 that "CDATA sections are replaced with their character content". This means the characters to mark the end of a CDATA section are removed as part of replacing that section with its character content.


If you are asking how to present what looks like a CDATA section so it can be retained as text without having replacement occur then this is not a canonicalization question, as the characters will be treated as ordinary text and not recognized as a CDATA section.   If the start of CDATA were to have < escaped as &lt; , for example, no CDATA section would be present, and canonical character encoding would occur in a uniform manner.


As a consequence no encoding need be specified and no errata is needed.


Does this make sense?


regards, Frederick


Frederick Hirsch, Nokia
Chair XML Security WG


[1] http://www.w3.org/TR/2001/REC-xml-c14n-20010315


For tracker this should complete ACTION-833

On Aug 30, 2011, at 9:20 AM, ext Steve DeRose wrote:

I recently discovered that the Canonical XML spec does not appear to specify  which of several possible options to use, to encode the literal string "]]>" in content. I have also checked the errata, and cannot find this mentioned there.

This strings marks the end of an XML CDATA marked section, so must be escaped somehow when needed literally. It seems to me that the best choice given other decisions in Canonical XML, is to express it as  "]]&gt;". That is the method used in the source for the current edition of the XML Recommendation. But of course there are multiple alternatives, including at least:


    &#x5D;]>
    ]&#x5D;>
    ]]&#x3E;
    &#x5D;&#x5D;>
    &#x5D;]&#x3E;
    &#x5D;&#x5D;&#x3E;
    &#x5D;]&gt;
    &#x5D;&#x5D;&gt;


Clearly, if different users or applications encode the same intended content in different ways, that's a problem in the context of Canonical XML. Whether the string is common is irrelevant. Yet, there are contexts where this string naturally occurs: the most obvious are documents describing XML, and documents containing program code examples such as "a[b[0]]>1".

Please specify a specific encoding for this string in Canonical XML documents.

Steve DeRose
sderose@acm.org<mailto:sderose@acm.org>
Received on Wednesday, 7 September 2011 14:54:18 UTC