XML and canonicalization

I've been mulling this over and studying the standards starting with
the basic XML 1.0 standard.  Probably lots of members of this WG are
very familiar with XML processing but perhaps what I say below will be
helpful for others...

No canonicalization makes sense for binary things.  Binary things,
like images or executables, can reasonable be expected to be truly
fixed.

The Minimal canonicalization we are defining (canonicalize character
set and line endings) makes sense for text.  Text comes in a variety
of character sets and not uncommonly gets its line endings changed
from platform to platform.  Of course, if something is handled as
binary, even though its text, you can avoid canonicalization.  But if
it is going to be interoperably processed as text, you would want at
least somthing like Minimal canonicalization.

The basic process of reading XML and presenting it to an application
(whether from an external file or a buffer in memory) is herent in any
XML processing and is destructive as far as information goes that XML
considers insignificant.  Very explicitly, attribute value white space
is normalized (unless the attribute is declared as CDATA).  In
particular, all leading and trailing white space is stripped from
attribute values and all internal runs of white space are converted to
a single space.  While I found it explicitly anywhere, XML experts
seem to take it as axiomatic that attribute ordering is insignificant
and that white space between items inside start/end tags is
insignificant.  The XML Infoset says that a CR-LF is converted to an
LF as is a CR not followed by an LF.  There are additional areas where
significance gets more murky, like white space between elements, which
in XSLT for example, is stripped unless you have specifically declared
it to be preserved.  Namespaces are also a somewhat murky area but
XPath and other specs treat a namespace declaration as distributing
its information across all child nodes unless they are shielded by
another namespace declaration with the same prefix.

What this means is that if you have a hunk of XML like

<Element	z=" a,  b,    c "
a="a"	xmlns:Prefix="data:1234"
	>	<A>1</A><B>
<C
Prefix:m="n" >
</C>
	</B>	</Element
>

and you did any XML processing with it, it would be nonconformant (in
the presence of a DTD declaration other than CDATA) not to convert the
value of the z attribute to "a, b, c".  And if would entirely
reasonable to get an internal representation which, if you output it,
was something like

<Element xmlns:Prefix="data:1234" a="a" z="a, b, c">
       <A xmlns:Prefix="data:1234">1</A><B xmlns:Prefix="data:1234">
<C xmlns:Prefix="data:1234" Prefix:m="n">
</C>
	</B>	</Element>

or a variety of amounts of normalization between this and the input.
All would be conformant to the XML rules.

The above isn't the canonical printout according to the current W3C
canonical XML proposal but will give you an idea of the normalization
that can occur to the internal data structure just from reading XML
for normal conformant XML processing.

Sure, if you have XML but are treating it as binary data, you many not
need any canonicalization.  And if you have XML and treat it just as
text, you may need only minimal canonicalization.  But if you are
going to process it as XML and want signatures over it that are
interoperable, I don't see how you can escape the need for XML
canonicalization.

SignedInfo is XML, is signed, and I would think we would want those
signatures to be interoperable.  Thus I conclude that at least the
default and quite possibly the fixed canonicalization for SignedInfo
must be an XML canonicalization.  Because we control the syntax of
SignedInfo, we can make additional choices.  Although I'm not
proposing any decision at this time, we do not make any use of XML
Comments in SignedInfo, for example, so if we decided that it was
reasonable never to do so and if we also decided there was utility in
allowing unsecured comments to be sprinkled into and removed from
SignedInfo, we could specify a default XML canonicalization (or
transform if you are worried about my use of the c14n word stepping on
the toes of the W3C official c14n effort) that stripped out all XML
comments.

Thanks,
Donald
=====================================================================
 Donald E. Eastlake 3rd   +1 914-276-2668     dee3@torque.pothole.com
 65 Shindegan Hill Road, RR#1  +1 914-784-7913(work)  dee3@us.ibm.com
 Carmel, NY 10512 USA

Received on Sunday, 24 October 1999 23:07:50 UTC