Draft Comments on InfoSet

John, here's my draft, feel free to tweak/add as appropriate, I have one explicit question for you at the end.

WG:Other comments/corrections before we send this to the Core WG?

Comments on XML Infoset [1].

While Canonical XML [2] is not based on the XML Information Set specification, future canonicalization (C14N) algorithms might be. Consequently, these comments try to identify limitations of the present C14N design in the context of the latest Infoset draft. As a point of history, Canonical XML was originally based on a selection and serialization of Information Set items. When Canonical XML was transferred to the XML Signature WG, it was changed so as to be based on XPath because XPath provided useful features related to document subsets (serializing portions of an XML document) and was already a W3C Recommendation.

These brief comments do not contain any editorial or substantive comments or suggestions. Instead, it points out similarities and divergences should a canonical form ever be based on [1].

[Infoset]  http://www.w3.org/TR/2001/WD-xml-infoset-20010202
[C14N] http://www.w3.org/TR/2001/PR-xml-c14n-20010119 


The Canonical XML specification identifies three limitations where:

>The difficulties arise due to the loss of the following information not available in the data model:
>1. base URI, especially in content derived from the replacement text 
>  of external general parsed entity references 
>2. notations and external unparsed entity references 
>3. attribute types in the document type declaration 

While some of these issues arise because information was "not available in the [XPath] data model" they also arise from the fact that this information is not represented in a standalone well-formed XML document which was one of the goals of Canonical XML. Where applications are concerned with this information, the Canonical XML specification (non-normatively) mentions how the information might be reintegrated into the canonical form if it will undergo subsequent processing -- but typically it is not, it is merely used in the hash computation of the signature creation. Having this information available via the Infoset makes the task of the application that wishes to do this much easier.

1. Canonical XML requires that Base URI be explicitly declared to mitigate problems of resolving relative URIs when the URI context is not known or when external entities with relative URI are incorporated into the document. This information is explicitly part of the InformationSet data model. 

**However, if I have a InfoSet and BaseURI compliant parser which resolves an external entity without explicit BaseURI declarations, would the BaseURI Infoset properties still be available via the heuristics of finding the BaseURI given in that specification:

>4. Resolving Relative URIs 
>4.1. Relation to RFC 2396 
>RFC 2396 [IETF RFC 2396] provides for base URI information to be embedded within a document. The rules for determining the base URI can be summarized as follows (highest priority to lowest):

2. "The loss of external unparsed entity references and the notations that bind them to applications means that canonical forms cannot properly distinguish among XML documents that incorporate unparsed data via this mechanism." [C14N]

C14N only preserves name of external unparsed entity references from the complete set available under the Infoset {name,  system identifier, public identifier, or notation}. An application with all of this information can more easily generate an additional signature reference over the actual external entity.

3. "the loss of attribute types can affect the canonical form in different ways depending on the type. Attributes of type ID cease to be ID attributes. Hence, any XPath expressions that refer to the canonical form using the id() function cease to operate... Applications can avoid the difficulties of this case by ensuring that an appropriate document type declaration is prepended prior to using the canonical form in further XML processing." [C14N] 

This information is preserved in the Information Set. However, as stated earlier, it's loss in canonical form is a result of the choice to create a standalone XML serialization, this data would be lost  under this constraint even if the C14N was based on Infoset though it might be easier to prepend the appropriate DTD as recommended above.


>Furthermore, this specification does not define an information set for documents which use relative URI references in namespace declarations. This is in accordance with the decision of the W3C XML Plenary Interest Group described in [Relative Namespace URI References]. Thus the value of a [namespace name] property is always an absolute URI with an optional fragment identifier. 

This corresponds with Canonical XML design:

>Note: This specification supports the recent XML plenary decision to deprecate relative namespace URIs as follows: implementations of XML canonicalization MUST report an operation failure on documents containing relative namespace URIs. XML canonicalization MUST NOT be implemented with an XML parser that converts relative URIs to absolute URIs.


>An information set describes its XML document with entity references already expanded, that is, represented by the information items corresponding to their replacement text. However, there are various circumstances in which a processor may not perform this expansion. 

Canonical XML purposefully fails when external parsed entities can not resolve.


>Base URIs
>Several information items have a [base URI] property. This is computed according to [XML Base]. Note that retrieval of a resource may involve redirection at the parser level (for example, in an entity resolver) or below; in this case the base URI is the final URI used to retrieve the resource after all redirection. 

Canonical XML can also work with explicit declarations of Base URI, see issue 1 above.


The Infoset specification introduces the concept of a "Synthetic Infoset" that is not a result of parsing an XML document; instead its a Infoset of partial XML that might result by use of an API or DOM. This corresponds to the "document subset" of Canonical XML.


Information Set provides for Entity and CDATA start and end mark information items. 

{John: could you take a stab at describing how this is useful in dealing with PIs in external entities?}

Joseph Reagle Jr.                 http://www.w3.org/People/Reagle/
W3C Policy Analyst                mailto:reagle@w3.org
IETF/W3C XML-Signature Co-Chair   http://www.w3.org/Signature
W3C XML Encryption Chair          http://www.w3.org/Encryption/2001/

Received on Thursday, 15 February 2001 16:40:33 UTC