RE: Draft Comments on InfoSet

Hi Joseph,

Comments in line surrounded by <john> </john>.

John Boyer
Senior Product Architect, Software Development
Internet Commerce System (ICS) Team
PureEdge Solutions Inc.
Trusted Digital Relationships
v: 250-708-8047 f: 250-708-8010
1-888-517-2675 http://www.PureEdge.com <http://www.pureedge.com/>

-----Original Message-----
From: Joseph M. Reagle Jr. [mailto:reagle@w3.org]
Sent: Thursday, February 15, 2001 1:40 PM
To: John Boyer
Cc: IETF/W3C XML-DSig WG; pgrosso@arbortext.com
Subject: Draft Comments on InfoSet

John, here's my draft, feel free to tweak/add as appropriate, I have one explicit question for you at the end.

WG:Other comments/corrections before we send this to the Core WG?
______

Comments on XML Infoset [1].

While Canonical XML [2] is not based on the XML Information Set specification, future canonicalization (C14N) algorithms might be. Consequently, these comments try to identify limitations of the present C14N design in the context of the latest Infoset draft. As a point of history, Canonical XML was originally based on a selection and serialization of Information Set items. When Canonical XML was transferred to the XML Signature WG, it was changed so as to be based on XPath because XPath provided useful features related to document subsets (serializing portions of an XML document) and was already a W3C Recommendation.

<john>
(serializing portions of an XML document) => (which can be used to serialize portions of an XML document)
</john>

These brief comments do not contain any editorial or substantive comments or suggestions. Instead, it points out similarities and divergences should a canonical form ever be based on [1].

<john>It would actually be better to base a new XPath on InfoSet, then define c14n in terms of the updated XPath, then add a new algorithm to DSig. Another possibility would be to insert a new version of Xpointer, then define a new c14n based on Xpointer. </john>

[Infoset] http://www.w3.org/TR/2001/WD-xml-infoset-20010202
[C14N] http://www.w3.org/TR/2001/PR-xml-c14n-20010119

The Canonical XML specification identifies three limitations where:

>The difficulties arise due to the loss of the following information not available in the data model:
>1. base URI, especially in content derived from the replacement text
> of external general parsed entity references
>2. notations and external unparsed entity references
>3. attribute types in the document type declaration
>http://www.w3.org/TR/2001/PR-xml-c14n-20010119#Limitations

While some of these issues arise because information was "not available in the [XPath] data model" they also arise from the fact that this information is not represented in a standalone well-formed XML document which was one of the goals of Canonical XML. Where applications are concerned with this information, the Canonical XML specification (non-normatively) mentions how the information might be reintegrated into the canonical form if it will undergo subsequent processing -- but typically it is not, it is merely used in the hash computation of the signature creation. Having this information available via the Infoset makes the task of the application that wishes to do this much easier.

<john>My claim that we don't represent it because it isn't in the data model is based on the belief that we can represent anything in XML, esp. via the use of namespaces. In the case of base URI, please see the example at the bottom of this document, which is in answer to your explicit question. </john>

1. Canonical XML requires that Base URI be explicitly declared to mitigate problems of resolving relative URIs when the URI context is not known or when external entities with relative URI are incorporated into the document. This information is explicitly part of the InformationSet data model.

**However, if I have a InfoSet and BaseURI compliant parser which resolves an external entity without explicit BaseURI declarations, would the BaseURI Infoset properties still be available via the heuristics of finding the BaseURI given in that specification:

>4. Resolving Relative URIs
>4.1. Relation to RFC 2396
>RFC 2396 [IETF RFC 2396] provides for base URI information to be embedded within a document. The rules for determining the base URI can be summarized as follows (highest priority to lowest):

>http://www.w3.org/TR/2000/PR-xmlbase-20001220/#rfc2396

2. "The loss of external unparsed entity references and the notations that bind them to applications means that canonical forms cannot properly distinguish among XML documents that incorporate unparsed data via this mechanism." [C14N]

C14N only preserves name of external unparsed entity references from the complete set available under the Infoset {name, system identifier, public identifier, or notation}. An application with all of this information can more easily generate an additional signature reference over the actual external entity.

<john>Exactly. I just want to be able to retain enough info in the canonical form so that it is actually bound to the external unparsed entity. Signing a document + an external unparsed entity is not enough if I can change the external unparsed entity to which the document refers without breaking the signature.</john>

3. "the loss of attribute types can affect the canonical form in different ways depending on the type. Attributes of type ID cease to be ID attributes. Hence, any XPath expressions that refer to the canonical form using the id() function cease to operate... Applications can avoid the difficulties of this case by ensuring that an appropriate document type declaration is prepended prior to using the canonical form in further XML processing." [C14N]

This information is preserved in the Information Set. However, as stated earlier, it's loss in canonical form is a result of the choice to create a standalone XML serialization, this data would be lost under this constraint even if the C14N was based on Infoset though it might be easier to prepend the appropriate DTD as recommended above.

<john>Actually, the information is at least recoverable based on what InfoSet provides. Under the properties of an attribute, one is its type and another is whether the attribute is defaulted. There appears to be enough information to create ATTLIST declarations that preserve all of the relevant information.</john>

>http://www.w3.org/TR/xml-infoset/#intro
>Furthermore, this specification does not define an information set for documents which use relative URI references in namespace declarations. This is in accordance with the decision of the W3C XML Plenary Interest Group described in [Relative Namespace URI References]. Thus the value of a [namespace name] property is always an absolute URI with an optional fragment identifier.

This corresponds with Canonical XML design:

>http://www.w3.org/TR/2001/PR-xml-c14n-20010119#DataModel
>Note: This specification supports the recent XML plenary decision to deprecate relative namespace URIs as follows: implementations of XML canonicalization MUST report an operation failure on documents containing relative namespace URIs. XML canonicalization MUST NOT be implemented with an XML parser that converts relative URIs to absolute URIs.

>http://www.w3.org/TR/xml-infoset/#intro
>Entities
>An information set describes its XML document with entity references already expanded, that is, represented by the information items corresponding to their replacement text. However, there are various circumstances in which a processor may not perform this expansion.

Canonical XML purposefully fails when external parsed entities can not resolve.

<john>Personally, I'd prefer it if external entities were left unresolved in canonical forms provided that enough DTD information can be kept to preserve their meaning. The onus should be on signature applications to sign all of the related stuff by adding Reference elements to the Signature.</john>

>http://www.w3.org/TR/xml-infoset/#intro
>Base URIs
>Several information items have a [base URI] property. This is computed according to [XML Base]. Note that retrieval of a resource may involve redirection at the parser level (for example, in an entity resolver) or below; in this case the base URI is the final URI used to retrieve the resource after all redirection.

Canonical XML can also work with explicit declarations of Base URI, see issue 1 above.

The Infoset specification introduces the concept of a "Synthetic Infoset" that is not a result of parsing an XML document; instead its a Infoset of partial XML that might result by use of an API or DOM. This corresponds to the "document subset" of Canonical XML.

<john>With respect to well-formed XML document read in by an API, Synthetic InfoSet appears to be saying that you can create an information set that is a subset of the information set of the original document, possibly by specifying the subset with an XPointer expression.

If so, then HOORAY!
</john>
__

Information Set provides for Entity and CDATA start and end mark information items.

{John: could you take a stab at describing how this is useful in dealing with PIs in external entities?}

<john>
Yes. Suppose I have an external entity of the following form:

<?PI1?> <e> <?PI2?> </e> <?PI3?>

such that substitution into a document of the form

<doc> &extEnt; <?PI4?> </doc>

becomes

I want to retain the original meaning of element 'e' as well as PI1, PI2 and PI3. One way to do this is to add an xml:base attribute that preserves the base URI of the external entity in the replacement text. Problem is, since xml:base is an attribute, it will cover PI2 by adding it to 'e', but it will not cover PI1 and PI3, which therefore experience a change in meaning when sustituted into the document if they use the base URI *and* if the document has a different base URI than the external entity (which is quite reasonable).

If I knew the start and end of the external entity substitution text, as well as its base URI, I could do something like the following:

The resulting XML may not be directly usable since the addition of such an element has made 'e' the grandchild of doc, where it used to be the child, so embedded self-referential XPaths may fail. However, this technique is useful for signatures since it generates an XML string that is 1) easily seen to mean the same thing as the source document because it is XML, and 2) is different based on whether a PI like PI3 or PI4 are in the external entity or in the main document.

</john>
__
Joseph Reagle Jr.                 http://www.w3.org/People/Reagle/
W3C Policy Analyst                mailto:reagle@w3.org
IETF/W3C XML-Signature Co-Chair   http://www.w3.org/Signature
W3C XML Encryption Chair          http://www.w3.org/Encryption/2001/