- From: Joseph M. Reagle Jr. <reagle@w3.org>
- Date: Thu, 22 Feb 2001 15:12:21 -0500
- To: www-xml-infoset-comments@w3.org
- Cc: Paul Grosso <pgrosso@arbortext.com>, "IETF/W3C XML-DSig WG" <w3c-ietf-xmldsig@w3.org>
Comments on XML Infoset [1].
Reviewer: Joseph Reagle.
While Canonical XML [2] is not based on the XML Information Set
specification, future canonicalization (C14N) algorithms might be.
Consequently, these comments try to identify limitations of the present C14N
design in the context of the latest Infoset draft. As a point of history,
Canonical XML was originally based on a selection and serialization of
Information Set items. When Canonical XML was transferred to the XML
Signature WG, it was changed so as to be based on XPath because XPath
provided useful features related to document subsets (serializing portions
of an XML document) and was already a W3C Recommendation.
These brief comments do not contain any editorial or substantive comments or
suggestions. Instead, it points out similarities and divergences should a
canonical form ever be based on [1], and asks a few questions.
[Infoset] http://www.w3.org/TR/2001/WD-xml-infoset-20010202
[C14N] http://www.w3.org/TR/2001/PR-xml-c14n-20010119
__
The Canonical XML specification identifies three limitations where:
>The difficulties arise due to the loss of the following information not
available in the data model:
>1. base URI, especially in content derived from the replacement text
> of external general parsed entity references
>2. notations and external unparsed entity references
>3. attribute types in the document type declaration
>http://www.w3.org/TR/2001/PR-xml-c14n-20010119#Limitations
While some of these issues arise because information was "not available in
the [XPath] data model" they also arise from the fact that this information
is not represented in a standalone well-formed XML document which was one of
the goals of Canonical XML. Where applications are concerned with this
information, the Canonical XML specification (non-normatively) mentions how
the information might be reintegrated into the canonical form if it will
undergo subsequent processing -- but typically it is not, it is merely used
in the hash computation of the signature creation. Having this information
available via the Infoset makes the task of the application that wishes to
do this much easier.
1. Canonical XML requires that Base URI be explicitly declared to mitigate
problems of resolving relative URIs when the URI context is not known or
when external entities with relative URI are incorporated into the document.
This information is explicitly part of the InformationSet data model.
However, if I have a InfoSet and BaseURI compliant parser which resolves an
external entity without explicit BaseURI declarations, would the BaseURI
Infoset properties still be available in the "importing document" via the
heuristics [a] of finding the BaseURI when the "imported document" was parsed?
[a] http://www.w3.org/TR/2000/PR-xmlbase-20001220/#rfc2396
>4. Resolving Relative URIs
>4.1. Relation to RFC 2396
>RFC 2396 [IETF RFC 2396] provides for base URI information to be
>embedded within a document. The rules for determining the base
>URI can be summarized as follows (highest priority to lowest):
2. "The loss of external unparsed entity references and the notations that
bind them to applications means that canonical forms cannot properly
distinguish among XML documents that incorporate unparsed data via this
mechanism." [C14N]
C14N only preserves the name of external unparsed entity references from the
complete set available under the Infoset {name, system identifier, public
identifier, or notation}. An application with all of this information can
more easily generate an additional signature reference over the actual
external entity if a user wishes to protect the integrity of that information.
3. "the loss of attribute types can affect the canonical form in different
ways depending on the type. Attributes of type ID cease to be ID attributes.
Hence, any XPath expressions that refer to the canonical form using the id()
function cease to operate... Applications can avoid the difficulties of this
case by ensuring that an appropriate document type declaration is prepended
prior to using the canonical form in further XML processing." [C14N]
This information is preserved in the Information Set. However, as stated
earlier, it's loss in canonical form is a result of the choice to create a
standalone XML serialization, this data would be lost under this constraint
even if the C14N was based on Infoset though it might be easier to prepend
the appropriate DTD as recommended above.
__
>http://www.w3.org/TR/xml-infoset/#intro
>Furthermore, this specification does not define an information set for
documents which use relative URI references in namespace declarations. This
is in accordance with the decision of the W3C XML Plenary Interest Group
described in [Relative Namespace URI References]. Thus the value of a
[namespace name] property is always an absolute URI with an optional
fragment identifier.
This corresponds with Canonical XML design:
>http://www.w3.org/TR/2001/PR-xml-c14n-20010119#DataModel
>Note: This specification supports the recent XML plenary decision to
deprecate relative namespace URIs as follows: implementations of XML
canonicalization MUST report an operation failure on documents containing
relative namespace URIs. XML canonicalization MUST NOT be implemented with
an XML parser that converts relative URIs to absolute URIs.
__
>http://www.w3.org/TR/xml-infoset/#intro
>Entities
>An information set describes its XML document with entity references
already expanded, that is, represented by the information items
corresponding to their replacement text. However, there are various
circumstances in which a processor may not perform this expansion.
Canonical XML purposefully fails when external parsed entities can not
resolve.
__
>http://www.w3.org/TR/xml-infoset/#intro
>Base URIs
>Several information items have a [base URI] property. This is computed
according to [XML Base]. Note that retrieval of a resource may involve
redirection at the parser level (for example, in an entity resolver) or
below; in this case the base URI is the final URI used to retrieve the
resource after all redirection.
Canonical XML can also work with explicit declarations of Base URI, see
issue 1 above.
__
The Infoset specification introduces the concept of a "Synthetic Infoset"
that is not a result of parsing an XML document; instead its a Infoset of
partial XML that might result by use of an API or DOM. This seems to
correspond to the "document subset" of Canonical XML.
__
Joseph Reagle Jr. http://www.w3.org/People/Reagle/
W3C Policy Analyst mailto:reagle@w3.org
IETF/W3C XML-Signature Co-Chair http://www.w3.org/Signature
W3C XML Encryption Chair http://www.w3.org/Encryption/2001/
Received on Thursday, 22 February 2001 16:06:58 UTC