Re: ISSUE-201: C14N 2.0 handling of DTD-related and Schema-related behaviors

We should discuss this on the call tomorrow, thanks for the excellent outline of issues.

Would it make sense to explicitly tell c14n2 the type of schema used (DTD, XSD, RNG etc) as an input, or which schema validation, if any, has already been performed on the input?

It seems there are too many implicit assumptions/cases possible.

regards, Frederick


On May 10, 2010, at 1:47 PM, ext Scott Cantor wrote:

> This issue relates to how many and which options to provide for c14n 2.0 to
> support different modes of operation related to the behavior of validating
> XML processors used in conjunction with XML Signature 2.0.
> 
> If I got anything wrong here, particularly in the area of DTDs and what core
> XML says, please let me know. I'm working off of some knowledge but a lot of
> actual reading of the text.
> 
> Some Key Issues:
> 
> Is attribute normalization solely a property of validation or just built-in
> to XML?
> 
> C14N 2.0 needs to talk about how to turn an octet stream into an XML
> document in accordance with the options we support.
> 
> Can C14N 2.0 options add to the requirements of conformant XML processors
> (e.g. the DTD options)?
> 
> Should C14N 2.0 have fine grained options for DTD behaviors or just a single
> ignore flag?
> 
> Do we need additional support in XML Sig 2.0 for IDness? Related to ACTION
> 456.
> 
> Should we tackle schemas at all or continue to ignore them?
> 
> -- Scott
> 
> Validation in XML
> 
> "Validation" in XML itself is strictly confined to the processing of
> internal and external DTD information, enforcement of document constraints,
> defaulting of attribute values, identification of ID attributes, and
> replacement of entity references in a document. It does NOT include the
> processing of non-DTD grammars such as XSD or RNG schemas, which overlap
> with some of these features.
> 
> All of these behaviors can influence c14n and signature processing in
> various ways, with the exception of constraint checking. It's also the case
> that even non-validating XML processors are obligated (according to the XML
> spec anyway) to parse the internal DTD subset and perform entity
> replacement, add defaults, etc. They are not obligated to parse external
> subsets, and this can result in unexpanded entity references remaining in a
> document (and this is why the XML Infoset includes that notion).
> 
> It's also true that most, perhaps all, XML parsers include a variety of
> options, mostly defined in informal or non-standard fashion, to disable some
> of these behaviors. (This includes behaviors that are, according to XML
> itself, required.)
> 
> Validation by Schema
> 
> XML itself does not define "validity" in terms of anything except for DTDs,
> but other specifications like XML Schema (XSD) have done so on their own.
> (Presumably this also holds for other schema languages like RELAX-NG?)
> 
> As with DTD processing, XSD can allow for default attributes to be declared.
> It does *not* support entity references. It does have the notion of ID
> attributes, but there has been controversy in the past as to whether it's
> "appropriate" to discuss ID-ness in the absence of a DTD, and this is one of
> the reasons xml:id was proposed.
> 
> In addition, XSD defines the notion of data type normalization, in which the
> lexical form of an attribute or element value is modified after parsing into
> a "canonical" normalized form. This is sometimes exposed as a separate
> property on a DOM node, but in the past has been implemented by actually
> changing the result in the DOM itself.
> 
> Existing C14N Behavior
> 
> The C14N 1.x suite of algorithms includes the following "assumed" processing
> that are related to the behaviors of validating XML processors and DTD
> content:
> 
> - parsed entity references are replaced (including external references)
> - default attributes are added to each element
> - attribute value normalization
> 
> (The latter is implied by C14N 1.x to be related to "validating processor"
> behavior, but my reading of XML 1.0 suggests that it's performed by both
> validating and non-validating processors, so may be irrelevant to this
> topic. Or I could be misreading things.)
> 
> Note that these steps are discussed only in terms of c14n of an octet stream
> to produce a node set to be processed. In other words, C14N 1.x is a bit
> vague on the fact that if an existing node set is to be operated on, such
> steps as entity replacement and attribute defaults would presumably already
> have had to occur (and if they didn't occur, the c14n algorithm would have
> no way of knowing this). This exposes a subtle point; even with existing
> c14n, the actual output could depend on the parser settings used.
> 
> Finally, C14N 1.x explicitly do NOT allow for non-DTD schemas to be
> incorporated into the octet-stream -> node set transformation, which is to
> say that none of the changes that validation by XSD might cause are supposed
> to be in scope for the result of c14n. This means for example that defaults
> defined in an internal DTD subset *are* injected, but defaults defined in a
> schema are not. This again shows how significant the difference between
> passing a node set or an octet stream to c14n would be.
> 
> Current Draft of C14N 2.0
> 
> The current draft proposal includes a pair of options that are related to
> this issue, ignoreDTD and expandEntities. The latter is defined as "if set
> to true ignore all entity declarations, and expand only the predefined
> entities (lt, gt, amp, apos, quot) and character references".
> 
> The former presumably implies the latter and expands the steps to *skip* to
> include DTD-defined default attribute values. It also mentions skipping
> attribute value normalization; we should determine whether that in fact is
> depending on DTD processing. (It may be true that the results of
> normalization would be different if the DTD is processed, but it doesn't
> seem to me that the DTD is required for normalization to happen.)
> 
> As with C14N 1.x, there are no features designed to work in conjunction with
> non-DTD schema languages.
> 
> Open Issues with C14N 2.0 Draft
> 
> The current draft does not (that I can find right now) make a distinction
> between processing an octet stream as input, and a set of "document
> subtrees" (i.e. DOM nodes). This is something the old specs do talk about,
> and I think we need to do that here as well. Essentially, we have to direct
> implementers as to what parser settings have to be used when processing a
> whole XML document as input, and the c14n parameters we define would
> probably correspond to those options.
> 
> In thinking about this issue, it seems to me that we also have to consider
> whether any such options we include correspond to behavior that an XML
> processor is actually expected to support. We can decide that "all parsers
> support things that the XML spec doesn't require", but that's probably
> something we should be explicit about and document in the spec.
> 
> Note also that ignoring the DTD could have affects on ID attributes and thus
> may argue for additional signature functionality related to identifying ID
> attributes absent a DTD.
> 
> Since DTDs are very different from schemas, and since entity expansion is
> generally a DTD-only feature, I disagree with trying to "unify" the
> treatment of DTDs and schemas. These should be separate problems, and even
> if there were options for both they would have separate ramifications
> anyway.
> 
> Since schemas are essentially just ignored now, the question to address
> there is whether they should be "un-ignored". The biggest argument IMHO for
> ignoring them is that unlike DTDs, XML documents have no normative means of
> binding themselves to a schema. There are hints and implications but nothing
> normative; in fact, there are some proposals for a PI to introduce such a
> concept.
> 
> In practice, schemas really introduce three components related to signatures
> and c14n:
> 
> - default values
> - data type normalization
> - IDness of attributes
> - QName content models
> 
> The first is pretty well known to be a bad idea. Simple fix: stop using
> them.
> 
> The second has been worked around pretty easily in newer code by avoiding
> direct mods of the DOM itself.
> 
> The third remains a big issue, but could be addressed with an explicit
> Signature syntax for identifying "intended" ID attributes within the scope
> of a reference. And this is also a problem with ignoring DTDs so probably
> needs to be looked at anyway.
> 
> The fourth we already have some discussion around.
> 
> -- Scott
> 
> 
> 

Received on Monday, 17 May 2010 14:16:39 UTC