- From: Tim Berners-Lee <timbl@w3.org>
- Date: Mon, 12 Jan 2004 17:10:50 -0500
- To: 'www-tag@w3.org' <www-tag@w3.org>
Recently there seems to be a common thread around processing what I will call "chunk of XML". Cases I am aware of: - XML itself uses it for an external entity - XML schema has the "Deep equality" issue as to when any two chunks are "equal". - RDF has a "XML Literal" data type which it handles transparently. It needs a notion of when two chunks are the same. - XML-DSig signs, and therefore ensures the integrity of, a chunk of XML You can add your own example to this list. The problem is that when different parts of a complex system have different notions of what a chunk of XML is, then the system built as a whole may break. (For example, suppose a Java object is serialized as XML, and the result put into a database and then exported as an RDF data value, signed, shipped across an insecure channel, the signature checked, parsed as an infoset from which a new Java object is built. The signature does not sign the XML base which applied to the chunk, and this was tampered with in transit. The result is that the Java object has been tampered with even though the signature matched.) The XML architecture has tended to be built according to a motto that all kinds of things are possible, and the application has to be able to chose the features it needs. This is fine when there are simply the XML toolset and a single "application". However, real life is more complicated, and things are connected together in all kinds of ways. I think the XML design needs to be more constraining: to offer a consistent idea of what a chunk of XML is across all the designs, so that the value of that chunk can be preserved as invariant across a complex system. Digital Signature and RDF transport are just intermediate parts of the design which need to be transparent. This required a notion of equality, and a related canonical serialization. Among the components of the problem, the ways serializations vary are: - The underlying data model: SOAP and WSDL and XML-Include use the Infoset; DSig used XPath 1.0 and XQ I understand uses XPath 2.0 data model. No group seems interested in invariance in the serialized XML, but only in the parsed XML of some form. - Whether xml:lang is included. (Not done in the DSig canon'ns, Internat'n would like to see it in) - Whether xml:base is included (xml:base did not exist when DSig did their canon'ns) - Whether extraneous namespace settings not obviously used are included. (This is the difference between XML DSig's exclusive and inclusive canon'n) This may be relevant to the upcoming rechartering of XML Core group. I feel that the XML development community has to take on this responsibility: asking each group which has a concept of a chunk of XML to define its own canonicalization will I think lead to a broken overall architecture. I have dumped what I know of this issue, apologies for the lack of pointers. Others may be able to fill in pointers to the discussions of this in the various groups, and give more examples. Tim BL
Received on Monday, 12 January 2004 17:10:54 UTC