Re: Serialization and canonicalization

It is true that C14N makes irreversible changes to XML documents.
However, it is also true that you can NOT exactly preserve an
XML document (I mean, as a character string) if you use an XML
processor as described in XML 1.0 specification.  A conformant
processor MUST normalize attribute values, for example.
A conformat processor may discard information on how many
white space characters appeared in between attributes,
as another example.

In other words, applications rely on XML processors to extract
logical information expressed in XML.  This logical information
is collectively called Information Set.  It is unfortunate that
Information Set was not defined PRIOR TO XML 1.0, but still
I believe that subsequent XML-related specifications should
be defined in terms of Information Set.  When I say "preserve
information", I mean "preserve information set".

If we assume that XML documents are processed by conformat
XML processors before passed to an application, it is Information
Set that the application sees.  Therefore, preserving textual
representation is not important here.

Hiroshi

--
Hiroshi Maruyama
Manager, Internet Technology, Tokyo Research Laboratory
+81-46-215-4576
maruyama@jp.ibm.com



From: hal@finney.org on 2000/11/13 10:16

Please respond to hal@finney.org

To:   Hiroshi Maruyama/Japan/IBM@IBMJP, xml-encryption@w3.org
cc:
Subject:  Re: Serialization and canonicalization



The C14N canonicalization makes irreversible changes to the document.
If we canonicalize before encrypting, there is no way we can recover
the original document upon decryption.

According to http://www.w3.org/TR/2000/CR-xml-c14n-20001026,
canonicalization
includes such steps as:

        Character and parsed entity references are replaced
        CDATA sections are replaced with their character content
        The XML declaration and document type declaration (DTD)
           are removed
        Empty elements are converted to start-end tag pairs
        Attribute value delimiters are set to double quotes
        Special characters in attribute values and character content
           are replaced by character references
        Superfluous namespace declarations are removed from each element
        Default attributes are added to each element
        Lexicographic order is imposed on the namespace declarations
           and attributes of each element

I think it would be desirable to retain the DTD and XML declarations
across the encryption/decryption transform (if we do include those
parts within the encrypted region).  Also I don't think we should add
default attributes to each element, or reorder attributes and namespace
declarations to lexicographic order, or do most of these other changes.

XML is more than a machine readable format.  The creator of the
document may have made decisions about the use of entities or character
encodings, quote style and ordering of attributes based on readability
and cleanliness.  C14N considers these aspects unimportant for functional
purposes and will change them.  That's fine for signature verification,
but not for encryption/decryption.

I am somewhat confused about the processing model which is envisioned
for XML encryption.  It appears that it may be something like:

  1. Parse XML into node-set
  2. Select node(s) to encrypt
  3. Serialize selected nodes into UTF-8 byte stream (along the lines of
     the C14N process)
  4. Encrypt the resulting byte stream using standard methods
  5. Package the encrypted byte stream with XML wrappers
  6. Insert the resulting XML node-set back into the original document
     in place of original node-set (as one possibility at least)
  7. Re-serialize modified document to produce output

I was thinking in terms of an alternative, based more on the original
XML document.  In this model the parsing serves as a guide to identify
substrings of the original document which are targets for encryption.
These are encrypted, the data is wrapped in XML format, and the plaintext
substring is replaced with the serialized form of the XML-wrapped
encrypted ciphertext.

This is perhaps functionally the same as the node-set based model,
except that minimal canonicalization is used as defined by XML Signature,
or even no canonicalization at all is done.

Hal Finney
PGP Security

> From: "Hiroshi Maruyama" <MARUYAMA@jp.ibm.com>
> Date: Mon, 13 Nov 2000 09:25:31 +0900
>
> When encrypting a substructure of an XML document, we need to
> preserve the data model so that it will be decrypted into exactly
> the same substructure.  XML Canonicalization (or C14N) is one
> way to serialize an XML substructure without losing any information.
> As long as the data model (or information set) is preserved, any
> serialization method will do.  C14N satisfies this property and
> is implemented for XML Signature anyway, I think it is reasonable
> to reuse the C14N standard.
> By the way, I believe this discussion is exactly why I insist that
> the processing model of XML Encryption should be defined using
> the XML InfoSet (or equivalent data model).  It may free us from
> confusing questions such as character encoding, default
> attribute values, external entities, data types, and so on.
>
> Hiroshi

Received on Tuesday, 14 November 2000 00:34:25 UTC