No Character Normalization? from Kevin Regan on 2000-06-23 (w3c-ietf-xmldsig@w3.org from April to June 2000)

From: Kevin Regan <kevinr@valicert.com>
Date: Fri, 23 Jun 2000 13:18:00 -0700 (PDT)
To: jboyer@PureEdge.com
Cc: w3c-ietf-xmldsig@w3.org, kevinr@valicert.com
Message-id: <Pine.SOL.4.21.0006231312470.10341-100000@bugs.valicert.com>
Hi,

Let me preface my comments by saying that I do not consider myself an
expert in either XML or XML Signature/C14N.  However, I would like to
comment on the lack of character normalization in both specifications.
Please read this as a plea for clarification and personal edification
rather than a disparagement of the specifications.

Reading through the C14N spec, it states:

---------------------------------------------------------------

A.1 No Character Model Normalization

The Unicode standard [Unicode] allows multiple different representations
of certain "precomposed characters" (a simple example is ""). Thus two XML
documents with content that is equivalent for the purposes of most
applications may contain differing character sequences. The W3C has
recommended a normalized representation [CharModel]. Prior drafts of
Canonical XML used this normalized form. However, most XML 1.0 processors
do not perform the this normalization. Furthermore, applications that must
solve this problem typically perform the character model normalization as
character content is created, which would obviate the need for character
model normalization during canonicalization. Therefore, character model
normalization has been moved out of scope for Canonical XML.

----------------------------------------------------------------

In addition, the XML Signature spec states:

----------------------------------------------------------------

7.0 XML Canonicalization and Syntax Constraint Considerations

* * *

Any canonicalization algorithm should yield output in a specific fixed
coded character set. For both the minimal canonicalization defined in this
specification, the W3C Canonical XML [XML-C14N], and the 2000 Canonical
XML [XML-C14N-a], that coded character set is UTF-8. * * * Neither the
minimal canonicalization nor the 2000 Canonical XML [XML-C14N-a]
algorithms provide character normalization. We RECOMMEND that signature
applications produce XML content in Normalized Form C [NFC] and check that
any XML being consumed is in that form as well (if not, signatures may
consequently fail to validate).

-----------------------------------------------------------------

It seems that the responsibility for creating canonicalizable or signable
documents is being pushed to the application creating the XML documents to
be signed (as well as the application producing the XML Signature document
itself).  However, won't it most likely be the case that producers of XML
documents will not have nearly the resources or technical no-how to
reasonably perform this character normalization?  Will the producers of
XML documents even know that their work will be signed at some future
date?  In addition, doesn't this preclude the signing of XML documents
that may have already been created in something other than the "Normalized
Form C" format?  Wouldn't it make more sense to put the burden of
normalization on the application processing the XML document and producing
the signature? It is this application that will be most knowledgeable
about the need for character normalization and about the way in which
character normalization can be implemented.

The goal of the XML C14N spec seems to be to avoid the additional work
(which, admittedly, is not trivial) of performing the character
normalization step, pushing this on to the application that actually uses
C14N.  However, it is the XML Signature "application" that C14N is most
meant to support. Therefore, it seems that the character normalization
must either be called for in C14N or in the XML Signature specification
itself.

Currently, the XML Signature spec recommends creating a failure condition
when the appropriate normalized form for input is not detected as well as
creating its output in the same normalized form.  Is this less work than
simply converting all documents that are being processed into the
normalized form before computing the signature?  Wouldn't this allow us to
eliminate a failure case (and the added complexity given to the producers
of XML documents)?

One final question.  Is it possible for the processing of an XML document
to change the character format?  If so, wouldn't this add to the failure
case mentioned in the previous paragraph?

It seems that the door is being opened for a major incompatibility and the
inability to sign a large number of pre-existing and future XML documents
(that will be created without any regard given to character
normalization).

Sincerely,
Kevin Regan

kevinr@valicert.com
Received on Friday, 23 June 2000 16:17:52 UTC