Revised current Section 7 from Donald E. Eastlake 3rd on 2000-01-25 (w3c-ietf-xmldsig@w3.org from January to March 2000)

From: Donald E. Eastlake 3rd <dee3@torque.pothole.com>
Date: Tue, 25 Jan 2000 16:18:30 -0500
To: w3c-ietf-xmldsig@w3.org
Message-Id: <200001252118.QAA25295@torque.pothole.com>
<H2>7.0 <A name=sec-XML-Canonicalization>XML Canonicalization</A> and Syntax 
Constraint Considerations</H2>
<P>Digital signatures only work if the verification calculations are performed 
on exactly the same bits as the signing calculations. If the surface 
representation of the signed data can change between signing and verification, 
then some way to standardize the changeable aspect must be used before signing 
and verification. For example, even for simple ASCII text there are at least 
three widely used line ending sequences. If it is possible for signed text to be 
modified from one line ending convention to another between the time of signing 
and signature verification, then the line endings need to be canonicalized to a 
standard form before signing and verification or the signatures will break. </P>
<P>XML is subject to surface representation changes and to processing which 
discards some surface information. For this reason, XML digital signatures have 
a provision for indicating canonicalization methods in the signature so that a 
verifier can use the same canonicalization as the signer. </P>
<P>Throughout this document we distinguish between the canonicalization of a 
<TT>Signature</TT> data object and other signed XML data objects. It is possible 
for an isolated XML document to be treated as if it were binary data so that no 
changes can occur. In that case, the digest of the document will not change and 
it need not be canonicalized if it is signed and verified as such. However, XML 
that is read and processed using standard XML parsing and processing techniques 
is frequently changed such that some of its surface representation information 
is lost or modified. In particular, this will occur in many cases for the 
<TT>Signature</TT> and enclosed <TT>SignedInfo</TT> elements since they, and 
possibly an encompassing XML document, will be processed as XML. </P>
<P>Similarly, these considerations apply to <TT>Manifest</TT>, <TT>Object</TT>, 
and <TT>SignatureProperties</TT> elements if those elements have been digested, 
their <TT>DigestValue</TT> is to be checked, and they are being processed as 
XML.</P>
<P>The kinds of changes in XML that may need to be canonicalized can be divided 
into three categories. There are those related to the basic [XML], as described 
in 7.1 below. There are those related to [DOM], [SAX], or similar processing as 
described in 7.2 below. And, third, there is the possibility of character set 
conversion, such as between UTF-8 and UTF-16, both of which all XML standards 
compliant processors are required to support. Any canonicalization algorithm 
should yield output in a specific fixed character set. For both the minimal 
canonicalization defined in this document and the W3C Canonical XML [<A 
href="http://www.w3.org/Signature/Drafts/WD-xmldsig-core-20000114/Overview.html#ref-XML-c14n">XML-c14n</A>], 
that character set is UTF-8. </P>
<H3>7.1 <A name=sec-XML-1>XML 1.0</A>, Syntax Constraints, and 
Canonicalization</H3>
<P>XML 1.0 [<A 
href="http://www.w3.org/Signature/Drafts/WD-xmldsig-core-20000114/Overview.html#ref-XML">XML</A>] 
defines an interface where a conformant application reading XML is given certain 
information from that XML and not other information. In particular, 
<OL>
  <LI>line endings are normalized to the single character #xA by dropping #xD 
  characters if they are immediately followed by a #xA and replacing them with 
  #xA in all other cases, 
  <LI>missing attributes declared to have default values are provided to the 
  application as if present with the default value, 
  <LI>character references are replaced with the corresponding character, 
  <LI>entity references are replaced with the corresponding declared entity, 
  <LI>attribute values are normalized by 
  <OL type=A>
    <LI>replacing character and entity references as above, 
    <LI>replacing occurrences of #x9, #xA, and #xD with #x20 (space) except that 
    the sequence #xD#xA is replaced by a single space, and 
    <LI>if the attribute is not declared to be CDATA, stripping all leading and 
    trailing spaces and replacing all interior runs of spaces with a single 
    space, and </LI></OL>
  <LI>for elements declared to have element content, eliminate white space 
  that appears within their content but not within the content of any enclosed 
  element. </LI></OL>
<P>Note that items (2), (4), (5C), and (6) depend on specific Schema, DTD, or 
similar declarations. In the general case, such declarations will not be 
available to or used by the signature verifier. Thus, to interoperate between
different XML implementations, the following syntax contraints MUST be
observed when generating any signed material to be processed as XML,
including the  <TT>SignedInfo</TT> element:
<OL>
  <LI>attributes having default values be explicitly present, 
  <LI>all entity references (except "amp", "lt", "gt", "apos", and "quot" which 
  are pre-defined) be expanded, 
  <LI>attribute value white space be normalized, and 
  <LI>insignificant white space not be generated within elements having element 
  content. </LI></OL>
<H3>7.2 <A name=sec-DOM-SAX>DOM/SAX</A> Processing and Canonicalization</H3>
<P>In addition to the canonicalization and syntax constraints discussed above, 
many XML applications use the Document Object Model [<A 
href="http://www.w3.org/Signature/Drafts/WD-xmldsig-core-20000114/Overview.html#ref-DOM">DOM</A>] 
or The Simple API for XML&nbsp; [<A 
href="http://www.w3.org/Signature/Drafts/WD-xmldsig-core-20000114/Overview.html#ref-SAX">SAX</A>]. 
DOM maps XML into a tree structure of nodes and typically assumes it will be 
used on an entire document with subsequent processing being done on this tree. 
SAX converts XML into a series of events such as a start tag, content, etc. In 
either case, many surface characteristics such as the ordering of attributes and 
insignificant white space within start/end tags is lost. In addition, namespace 
declarations are mapped over the nodes to which they apply, losing the namespace 
prefixes in the source text and, in most cases, losing the where namespace 
declarations appeared in the original instance.</P>
<P>If an XML Signature is to be produced or verified on a system using the DOM 
or SAX processing, a canonical method is needed to serialize the relevant 
part of a DOM tree or sequence of SAX events. XML canonicalization 
specifications, such as [<A 
href="http://www.w3.org/Signature/Drafts/WD-xmldsig-core-20000114/Overview.html#ref-XML-c14n">XML-c14n</A>], 
are based only on information which is preserved by DOM and SAX. For an XML 
Signature to be verifiable by an implementation using DOM or SAX, not only must 
the syntax constraints given in <A 
href="http://www.w3.org/Signature/Drafts/WD-xmldsig-core-20000114/Overview.html#sec-XML-1">section-7.1</A> 
be followed but an appropriate XML canonicalization MUST be specified so that 
the verifier can re-serialize DOM/SAX mediated input into the same byte sequence 
that was signed.</P>
Received on Tuesday, 25 January 2000 16:18:34 UTC