Re: Serialization and canonicalization

The C14N canonicalization makes irreversible changes to the document.
If we canonicalize before encrypting, there is no way we can recover
the original document upon decryption.

According to http://www.w3.org/TR/2000/CR-xml-c14n-20001026, canonicalization
includes such steps as:

        Character and parsed entity references are replaced 
        CDATA sections are replaced with their character content 
        The XML declaration and document type declaration (DTD)
           are removed
        Empty elements are converted to start-end tag pairs 
        Attribute value delimiters are set to double quotes 
        Special characters in attribute values and character content
           are replaced by character references
        Superfluous namespace declarations are removed from each element
        Default attributes are added to each element 
        Lexicographic order is imposed on the namespace declarations
           and attributes of each element

I think it would be desirable to retain the DTD and XML declarations
across the encryption/decryption transform (if we do include those
parts within the encrypted region).  Also I don't think we should add
default attributes to each element, or reorder attributes and namespace
declarations to lexicographic order, or do most of these other changes.

XML is more than a machine readable format.  The creator of the
document may have made decisions about the use of entities or character
encodings, quote style and ordering of attributes based on readability
and cleanliness.  C14N considers these aspects unimportant for functional
purposes and will change them.  That's fine for signature verification,
but not for encryption/decryption.

I am somewhat confused about the processing model which is envisioned
for XML encryption.  It appears that it may be something like:

  1. Parse XML into node-set
  2. Select node(s) to encrypt
  3. Serialize selected nodes into UTF-8 byte stream (along the lines of
     the C14N process)
  4. Encrypt the resulting byte stream using standard methods
  5. Package the encrypted byte stream with XML wrappers
  6. Insert the resulting XML node-set back into the original document
     in place of original node-set (as one possibility at least)
  7. Re-serialize modified document to produce output

I was thinking in terms of an alternative, based more on the original
XML document.  In this model the parsing serves as a guide to identify
substrings of the original document which are targets for encryption.
These are encrypted, the data is wrapped in XML format, and the plaintext
substring is replaced with the serialized form of the XML-wrapped
encrypted ciphertext.

This is perhaps functionally the same as the node-set based model,
except that minimal canonicalization is used as defined by XML Signature,
or even no canonicalization at all is done.

Hal Finney
PGP Security

> From: "Hiroshi Maruyama" <MARUYAMA@jp.ibm.com>
> Date: Mon, 13 Nov 2000 09:25:31 +0900
>
> When encrypting a substructure of an XML document, we need to
> preserve the data model so that it will be decrypted into exactly
> the same substructure.  XML Canonicalization (or C14N) is one
> way to serialize an XML substructure without losing any information.
> As long as the data model (or information set) is preserved, any
> serialization method will do.  C14N satisfies this property and
> is implemented for XML Signature anyway, I think it is reasonable
> to reuse the C14N standard.
> By the way, I believe this discussion is exactly why I insist that
> the processing model of XML Encryption should be defined using
> the XML InfoSet (or equivalent data model).  It may free us from
> confusing questions such as character encoding, default
> attribute values, external entities, data types, and so on.
>
> Hiroshi

Received on Sunday, 12 November 2000 22:11:56 UTC