XML Canonicalization and Syntax Constraint Considerations from dee3@us.ibm.com on 1999-12-20 (w3c-ietf-xmldsig@w3.org from October to December 1999)

From: <dee3@us.ibm.com>
Date: Mon, 20 Dec 1999 16:57:15 -0500
To: w3c-ietf-xmldsig@w3.org
Message-ID: <8525684D.00790951.00@D51MTA05.pok.ibm.com>
Since my presentation on canonicalization at the Wasington IETF meeting was
fairly well received, I though I would write it up with some more detail as
a proposed section of the Sytax and Processing draft.  The material I've
written is below.  My current feeling is that this should be a new top
level section although there are other places it could go...

Thanks,
Donald

Donald E. Eastlake, 3rd
IBM, 17 Skyline Drive, Hawthorne, NY 10532 USA
dee3@us.ibm.com   tel: 1-914-784-7913, fax: 1-914-784-3833

home: 65 Shindegan Hill Road, RR#1, Carmel, NY 10512 USA
dee3@torque.pothole.com   tel: 1-914-276-2668


X.0  XML Canonicalization and Syntax Constraint Considerations

Digital signatures only work if the verification calculations are performed
on exactly the same bits as the signing calculations.  If the surface
representation of the signed data can change between signing and
verification, then some way to standardize the changeable aspect must be
used before signing and verification.  For example, even with something as
simple as ASCII text, there are at least three different line ending
sequences in wide use.  If it is possible for signed text to be modified
from one line ending convention to another between the time of signing and
signature verification, then the line endings need to be canonicalized to a
standard form before signing and verification or signatures will break.

XML is subject to surface representation changes and to processing which
discards some surface information in typical applications.  For this
reason, XML digital signatures have provision for indicating
canonicalization methods in the signature so that a verifier can use the
same canonicalization before its verification calculations as was used by
the signer.

It is useful to distinguish the Signature element from separate signed XML
items.  It is possible for an isolated XML document to be treated as if it
were binary data so that no changes can occur.  In that case, the digest of
the document will not change and it need not be canonicalized if it is
signed and verified as data.  On the other hand, XML which is read and
processed using standard XML parsing and processing techniques is thereby
changed so that some of its surface representation information is lost or
modified.  In particular, this will occur in many cases for the Signature
and enclosed SignedInfo elements since they, and possibly an encompassing
XML document, will be processed as XML.

Similarly, these considerations apply to Manifest, Package, Object, and
SignatureProperties elements if those elements have been digested, their
DigestValue is to be checked, and they are being processed as XML.

The kinds of changes in XML which may need to be canonicalized can be
divided into three categories.  There are those related to the basic XML
1.0 standard, as described in X.1 below.  There are those related to DOM,
SAX, or similar processing and the like as described in X.2 below.  And,
third, there is the possibility of character set conversion, such as
between UTF-8 and UTF-16, both of which all XML standards compliant
processors are required to support. Any canonicalization algorithm should
yield output in a specific fixed character set.  For both the minimal
canonicalization defined in this document and the W3C standard XML
canonicalization, that character set is UTF-8.

X.1 XML 1.0, Syntax Constraints, and Canonicalization

The XML 1.0 Standard defines an interface where a conformant application
reading XML is given certain information from that XML and not other
information.  In particular, (1) line endings are normalized to the single
character #xA by dropping #xD characters if they are immediately followed
by a #xA and replacing them with #xA in all other cases, (2) missing
attributes declared to have default values are provided to the application
as if present with the default value, (3) character references are replaced
with the corresponding character, (4) entity references are replaced with
the corresponding declared entity, (5) attribute values are normalized by
(5A) replacing character and entity references as above, (5B) replacing
occurrences of #x9, #xA, and #xD with #x20 (space) except that the sequence
#xD#xA is replaced by a single space, and (5C) if the attribute is not
declared to be CDATA, stripping all leading and trailing spaces and
replacing all interior runs of spaces with a single space, and (6) for
elements declared to have element content, eliminate white space that
appears within their content but not within the content of any enclosed
element.

Note that items (2), (4), (5C), and (6) depend on specific Schema, DTD, or
similar declarations. In the general case, such declarations will not be
available to or used by the signature verifier.  Thus, for
interoperability, it is RECOMMENDED that the following syntax constraints
be observed when generating any material to be signed and processed as XML,
such as the SignedInfo element: (1) attributes having default values be
explicitly present, (2) all entity references (except "amp", "lt", "gt",
"apos", and "quot" which are pre-defined) be expanded, (3) attribute value
white space be normalized, and (4) insignificant white space not be
generated within elements having element content.

X.2 DOM/SAX Processing and Canonicalization

In addition to the canonicalization and syntax constrains discussed above,
most XML applications use the DOM standard or SAX interface for XML input.
DOM maps XML into a tree structure of nodes and typically assumes it will
be used on an entire document with subsequent processing being done on this
tree.  SAX converts XML into a series of events such as a start tag, text,
etc.  In either case, many surface characteristics such as the ordering of
attributes and insignificant white space within start/end tags is lost.  In
addition, namespace declarations are mapped over the nodes to which they
apply, losing the namespace prefixes in the source text and, in most cases,
losing the information as to exactly where namespace declarations appeared
in the original.

If an XML digital signature is to be produced or verified on a system using
the common DOM or SAX processing, the need is actually for a canonical
method to serialize the relevant part of a DOM tree or relevant sequence of
SAX events.  XML canonicalization specifications, such as the W3C standard,
are based only on information which is preserved by DOM and SAX.  For an
XML digital signature to be verifiable by an implementation using DOM or
SAX, not only must the syntax constraints given in X.1 be followed but an
appropriate XML canonicalization must be specified so that the verifier can
re-serialize DOM/SAX mediated input into the same byte sequence that was
signed.
Received on Monday, 20 December 1999 17:02:23 UTC