Re: Canonical XML error from Steve DeRose on 2011-09-07 (w3c-ietf-xmldsig@w3.org from July to September 2011)

From: Steve DeRose <steve.derose@openamplify.com>
Date: Wed, 07 Sep 2011 10:24:28 -0400
To: Frederick.Hirsch@nokia.com
Cc: jboyer@PureEdge.com, w3c-ietf-xmldsig@w3.org, public-xmlsec@w3.org, "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, Henry Thompson <ht@cogsci.ed.ac.uk>, chris lilley <chris@w3.org>
Message-ID: <1315405468.19616.38.camel@sderose-ThinkPad-T400>
Thank you for your reply. I believe you have missed the critical case.
The case of interest has nothing to do with *actual* CDATA marked
sections; it also has nothing to do with escaping the string that
normally *starts* a marked section, in order to make it literal (those
are the two cases you addressed).

Instead, the relevant case has to do with the particular string which
XML reserves as the delimiter that *ends* a marked section, and how that
string must be represented when occurring as literal text. Specifically,
what is the canonical form of a document containing that strubg. That is
a different question from either one that you address.

A relevant example is where the user wants a paragraph with this
*literal* content:


        In XML, the end of a marked section is indicated by "]]>".


It is an XML well-formedness error to have the literal string "]]>" as
text content as just shown, because it is an XML delimiter. This is
completely analogous to how it is illegal to have "<" or "&" as literal
text content, because they are all XML delimiters that are recognized in
text content.

In other words, this constitutes an XML well-formedness error:


        <p>In XML, the end of a marked section is indicated by
        "]]>".</p>


The Linux 'xmlparse' command (as just one example) correctly returns:


        2011-09-07 10:03:42.172 xmlparse[20298] WARNING foo is not a
        valid document
        2011-09-07 10:03:42.174 xmlparse[20298] Errors: at line: 1
        column: 57 ... Sequence ']]>' not allowed in content


To include that literal text content, an encoder must "escape" at least
one of the 3 characters. Such escaping is obviously possible (if it were
not, we would have dealt with it in the XML spec). For example:


        <p>In XML, the end of a marked section is indicated by
        "]]&gt;".</p>


HOWEVER, there are *many* distinct ways to "escape" that literal string.
Another example is:


        <p>In XML, the end of a marked section is indicated by
        "&#x5d;]>".</p>


Consider: a user hands the following document to an XML Canonicalizer:


        <xhtml><body><p>In XML, the end of a marked section is indicated
        by "&#x5d;]>".</p></body></xhtml>
        

Please tell me, based on Canonical XML as it is presently specified,
precisely what the canonical form of that document is. Is it like the
first escaped form above, or the second, or something else? What clause
in Canonical XML justifies your choice? What clause ensures that if I
run that document through 2 different canonicalization applications,
they will produce the same "canonical" result?

It seem to me that because two conforming Canonicalization applications
can produce different results from the same input, without violating the
Canonical XML specification, that amounts to a bug in the spec.

Clear?

Steven J. DeRose



On Tue, 2011-09-06 at 15:43 +0000, Frederick.Hirsch@nokia.com wrote:

> Steve 
> 
> 
> 
> The Canonical XML Recommendation [1] states in section 1.1 and details
> in section 2.1 that "CDATA sections are replaced with their character
> content". This means the characters to mark the end of a CDATA section
> are removed as part of replacing that section with its character
> content. 
> 
> 
> If you are asking how to present what looks like a CDATA section so it
> can be retained as text without having replacement occur then this is
> not a canonicalization question, as the characters will be treated as
> ordinary text and not recognized as a CDATA section.   If the start of
> CDATA were to have < escaped as &lt; , for example, no CDATA section
> would be present, and canonical character encoding would occur in a
> uniform manner.
> 
> 
> As a consequence no encoding need be specified and no errata is
> needed.
> 
> 
> Does this make sense?
> 
> 
> regards, Frederick
> 
> 
> Frederick Hirsch, Nokia
> Chair XML Security WG
> 
> 
> [1] http://www.w3.org/TR/2001/REC-xml-c14n-20010315
> 
> 
> For tracker this should complete ACTION-833
> 
> 
> 
> On Aug 30, 2011, at 9:20 AM, ext Steve DeRose wrote:
> 
> 
> 
> > I recently discovered that the Canonical XML spec does not appear to
> > specify  which of several possible options to use, to encode the
> > literal string "]]>" in content. I have also checked the errata, and
> > cannot find this mentioned there.
> > 
> > This strings marks the end of an XML CDATA marked section, so must
> > be escaped somehow when needed literally. It seems to me that the
> > best choice given other decisions in Canonical XML, is to express it
> > as  "]]&gt;". That is the method used in the source for the current
> > edition of the XML Recommendation. But of course there are multiple
> > alternatives, including at least:
> > 
> > 
> >     &#x5D;]>
> >     ]&#x5D;>
> >     ]]&#x3E;
> >     &#x5D;&#x5D;>
> >     &#x5D;]&#x3E;
> >     &#x5D;&#x5D;&#x3E;
> >     &#x5D;]&gt;
> >     &#x5D;&#x5D;&gt;
> > 
> > 
> > Clearly, if different users or applications encode the same intended
> > content in different ways, that's a problem in the context of
> > Canonical XML. Whether the string is common is irrelevant. Yet,
> > there are contexts where this string naturally occurs: the most
> > obvious are documents describing XML, and documents containing
> > program code examples such as "a[b[0]]>1".
> > 
> > Please specify a specific encoding for this string in Canonical XML
> > documents.
> > 
> > Steve DeRose
> > sderose@acm.org
> > 
> > 
> > 
> 
> 
>
Received on Wednesday, 7 September 2011 14:25:02 UTC