"Chunk of XML" - Canonicalization and equality from Tim Berners-Lee on 2004-01-12 (www-tag@w3.org from January 2004)

From: Tim Berners-Lee <timbl@w3.org>
Date: Mon, 12 Jan 2004 17:10:50 -0500
To: 'www-tag@w3.org' <www-tag@w3.org>
Message-Id: <2E1CE585-454C-11D8-B55E-000A9580D8C0@w3.org>
Recently there seems to be a common thread around processing what I 
will call  "chunk of XML".

Cases I am aware of:

- XML itself uses it for an external entity
- XML schema has the "Deep equality" issue as to when any two chunks 
are "equal".
- RDF has a "XML Literal" data type which it handles transparently.  It 
needs a notion of when two chunks are the same.
- XML-DSig signs, and therefore ensures the integrity of, a chunk of XML

You can add your own example to this list.

The problem is that when different parts of a complex system have 
different notions of what a chunk of XML is, then the system built as a 
whole may break.

(For example, suppose a Java object is serialized as XML, and the 
result put into a database and then exported as an RDF data value, 
signed, shipped across an insecure channel, the signature checked, 
parsed as an infoset from which a new Java object is built.  The 
signature does not sign the XML base which applied to the chunk, and 
this was tampered with in transit. The result is that the Java object 
has been tampered with even though the signature matched.)

The XML architecture has tended to be built according to a motto that 
all kinds of things are possible, and the application has to be able to 
chose the features it needs.  This is fine when there are simply the 
XML toolset and a single "application".  However, real life is more 
complicated, and things are connected together in all kinds of ways.  I 
think the XML design needs to be more constraining: to offer a 
consistent idea of what a chunk of XML is across all the designs, so 
that the value of that chunk can be preserved as invariant across a 
complex system.  Digital Signature and RDF transport are just 
intermediate parts of the design which need to be transparent.

This required a notion of equality, and a related canonical 
serialization.

Among the components of the problem, the ways serializations vary are:

- The underlying data model:  SOAP and WSDL and XML-Include use the 
Infoset;   DSig used XPath 1.0 and XQ I understand uses XPath 2.0 data 
model. No group seems interested in invariance in the serialized XML, 
but only in the parsed XML of some form.
- Whether xml:lang is included.  (Not done in the DSig canon'ns,  
Internat'n would like to see it in)
- Whether xml:base  is included (xml:base did not exist when DSig did 
their canon'ns)
- Whether extraneous namespace settings not obviously used are 
included. (This is the difference between XML DSig's exclusive and 
inclusive canon'n)

This may be relevant to the upcoming rechartering of XML Core group.
I feel that the XML development community has to take on this 
responsibility: asking each group which has a concept of a chunk of XML 
to define its own canonicalization will I think lead to a broken 
overall architecture.

I have dumped what I know of this issue, apologies for the lack of 
pointers.  Others may be able to fill in pointers to the discussions of 
this in the various groups, and give more examples.


Tim BL
Received on Monday, 12 January 2004 17:10:54 UTC