xmldsig Infoset Comments from Joseph M. Reagle Jr. on 2001-02-22 (w3c-ietf-xmldsig@w3.org from January to March 2001)

From: Joseph M. Reagle Jr. <reagle@w3.org>
Date: Thu, 22 Feb 2001 15:12:21 -0500
To: www-xml-infoset-comments@w3.org
Cc: Paul Grosso <pgrosso@arbortext.com>, "IETF/W3C XML-DSig WG" <w3c-ietf-xmldsig@w3.org>
Message-Id: <4.3.2.7.2.20010222150450.02b61dd8@rpcp.mit.edu>
Comments on XML Infoset [1].
Reviewer: Joseph Reagle.

While Canonical XML [2] is not based on the XML Information Set 
specification, future canonicalization (C14N) algorithms might be. 
Consequently, these comments try to identify limitations of the present C14N 
design in the context of the latest Infoset draft. As a point of history, 
Canonical XML was originally based on a selection and serialization of 
Information Set items. When Canonical XML was transferred to the XML 
Signature WG, it was changed so as to be based on XPath because XPath 
provided useful features related to document subsets (serializing portions 
of an XML document) and was already a W3C Recommendation.

These brief comments do not contain any editorial or substantive comments or 
suggestions. Instead, it points out similarities and divergences should a 
canonical form ever be based on [1], and asks a few questions.

[Infoset]  http://www.w3.org/TR/2001/WD-xml-infoset-20010202
[C14N] http://www.w3.org/TR/2001/PR-xml-c14n-20010119

__

The Canonical XML specification identifies three limitations where:

 >The difficulties arise due to the loss of the following information not 
available in the data model:
 >1. base URI, especially in content derived from the replacement text
 >  of external general parsed entity references
 >2. notations and external unparsed entity references
 >3. attribute types in the document type declaration
 >http://www.w3.org/TR/2001/PR-xml-c14n-20010119#Limitations

While some of these issues arise because information was "not available in 
the [XPath] data model" they also arise from the fact that this information 
is not represented in a standalone well-formed XML document which was one of 
the goals of Canonical XML. Where applications are concerned with this 
information, the Canonical XML specification (non-normatively) mentions how 
the information might be reintegrated into the canonical form if it will 
undergo subsequent processing -- but typically it is not, it is merely used 
in the hash computation of the signature creation. Having this information 
available via the Infoset makes the task of the application that wishes to 
do this much easier.

1. Canonical XML requires that Base URI be explicitly declared to mitigate 
problems of resolving relative URIs when the URI context is not known or 
when external entities with relative URI are incorporated into the document. 
This information is explicitly part of the InformationSet data model.

However, if I have a InfoSet and BaseURI compliant parser which resolves an 
external entity without explicit BaseURI declarations, would the BaseURI 
Infoset properties still be available in the "importing document" via the 
heuristics [a] of finding the BaseURI when the "imported document" was parsed?

[a]  http://www.w3.org/TR/2000/PR-xmlbase-20001220/#rfc2396
 >4. Resolving Relative URIs
 >4.1. Relation to RFC 2396
 >RFC 2396 [IETF RFC 2396] provides for base URI information to be
 >embedded within a document. The rules for determining the base
 >URI can be summarized as follows (highest priority to lowest):



2. "The loss of external unparsed entity references and the notations that 
bind them to applications means that canonical forms cannot properly 
distinguish among XML documents that incorporate unparsed data via this 
mechanism." [C14N]

C14N only preserves the name of external unparsed entity references from the 
complete set available under the Infoset {name,  system identifier, public 
identifier, or notation}. An application with all of this information can 
more easily generate an additional signature reference over the actual 
external entity if a user wishes to protect the integrity of that information.

3. "the loss of attribute types can affect the canonical form in different 
ways depending on the type. Attributes of type ID cease to be ID attributes. 
Hence, any XPath expressions that refer to the canonical form using the id() 
function cease to operate... Applications can avoid the difficulties of this 
case by ensuring that an appropriate document type declaration is prepended 
prior to using the canonical form in further XML processing." [C14N]

This information is preserved in the Information Set. However, as stated 
earlier, it's loss in canonical form is a result of the choice to create a 
standalone XML serialization, this data would be lost  under this constraint 
even if the C14N was based on Infoset though it might be easier to prepend 
the appropriate DTD as recommended above.

__

 >http://www.w3.org/TR/xml-infoset/#intro
 >Furthermore, this specification does not define an information set for 
documents which use relative URI references in namespace declarations. This 
is in accordance with the decision of the W3C XML Plenary Interest Group 
described in [Relative Namespace URI References]. Thus the value of a 
[namespace name] property is always an absolute URI with an optional 
fragment identifier.

This corresponds with Canonical XML design:

 >http://www.w3.org/TR/2001/PR-xml-c14n-20010119#DataModel
 >Note: This specification supports the recent XML plenary decision to 
deprecate relative namespace URIs as follows: implementations of XML 
canonicalization MUST report an operation failure on documents containing 
relative namespace URIs. XML canonicalization MUST NOT be implemented with 
an XML parser that converts relative URIs to absolute URIs.

__

 >http://www.w3.org/TR/xml-infoset/#intro
 >Entities
 >An information set describes its XML document with entity references 
already expanded, that is, represented by the information items 
corresponding to their replacement text. However, there are various 
circumstances in which a processor may not perform this expansion.

Canonical XML purposefully fails when external parsed entities can not 
resolve.

__

 >http://www.w3.org/TR/xml-infoset/#intro
 >Base URIs
 >Several information items have a [base URI] property. This is computed 
according to [XML Base]. Note that retrieval of a resource may involve 
redirection at the parser level (for example, in an entity resolver) or 
below; in this case the base URI is the final URI used to retrieve the 
resource after all redirection.

Canonical XML can also work with explicit declarations of Base URI, see 
issue 1 above.

__

The Infoset specification introduces the concept of a "Synthetic Infoset" 
that is not a result of parsing an XML document; instead its a Infoset of 
partial XML that might result by use of an API or DOM. This seems to 
correspond to the "document subset" of Canonical XML.


__
Joseph Reagle Jr.                 http://www.w3.org/People/Reagle/
W3C Policy Analyst                mailto:reagle@w3.org
IETF/W3C XML-Signature Co-Chair   http://www.w3.org/Signature
W3C XML Encryption Chair          http://www.w3.org/Encryption/2001/
Received on Thursday, 22 February 2001 16:06:58 UTC