ISSUE-201: C14N 2.0 handling of DTD-related and Schema-related behaviors

This issue relates to how many and which options to provide for c14n 2.0 to
support different modes of operation related to the behavior of validating
XML processors used in conjunction with XML Signature 2.0.

If I got anything wrong here, particularly in the area of DTDs and what core
XML says, please let me know. I'm working off of some knowledge but a lot of
actual reading of the text.

Some Key Issues:

Is attribute normalization solely a property of validation or just built-in
to XML?

C14N 2.0 needs to talk about how to turn an octet stream into an XML
document in accordance with the options we support.

Can C14N 2.0 options add to the requirements of conformant XML processors
(e.g. the DTD options)?

Should C14N 2.0 have fine grained options for DTD behaviors or just a single
ignore flag?

Do we need additional support in XML Sig 2.0 for IDness? Related to ACTION
456.

Should we tackle schemas at all or continue to ignore them?

-- Scott

Validation in XML

"Validation" in XML itself is strictly confined to the processing of
internal and external DTD information, enforcement of document constraints,
defaulting of attribute values, identification of ID attributes, and
replacement of entity references in a document. It does NOT include the
processing of non-DTD grammars such as XSD or RNG schemas, which overlap
with some of these features.

All of these behaviors can influence c14n and signature processing in
various ways, with the exception of constraint checking. It's also the case
that even non-validating XML processors are obligated (according to the XML
spec anyway) to parse the internal DTD subset and perform entity
replacement, add defaults, etc. They are not obligated to parse external
subsets, and this can result in unexpanded entity references remaining in a
document (and this is why the XML Infoset includes that notion).

It's also true that most, perhaps all, XML parsers include a variety of
options, mostly defined in informal or non-standard fashion, to disable some
of these behaviors. (This includes behaviors that are, according to XML
itself, required.)

Validation by Schema

XML itself does not define "validity" in terms of anything except for DTDs,
but other specifications like XML Schema (XSD) have done so on their own.
(Presumably this also holds for other schema languages like RELAX-NG?)

As with DTD processing, XSD can allow for default attributes to be declared.
It does *not* support entity references. It does have the notion of ID
attributes, but there has been controversy in the past as to whether it's
"appropriate" to discuss ID-ness in the absence of a DTD, and this is one of
the reasons xml:id was proposed.

In addition, XSD defines the notion of data type normalization, in which the
lexical form of an attribute or element value is modified after parsing into
a "canonical" normalized form. This is sometimes exposed as a separate
property on a DOM node, but in the past has been implemented by actually
changing the result in the DOM itself.

Existing C14N Behavior

The C14N 1.x suite of algorithms includes the following "assumed" processing
that are related to the behaviors of validating XML processors and DTD
content:

- parsed entity references are replaced (including external references)
- default attributes are added to each element
- attribute value normalization

(The latter is implied by C14N 1.x to be related to "validating processor"
behavior, but my reading of XML 1.0 suggests that it's performed by both
validating and non-validating processors, so may be irrelevant to this
topic. Or I could be misreading things.)

Note that these steps are discussed only in terms of c14n of an octet stream
to produce a node set to be processed. In other words, C14N 1.x is a bit
vague on the fact that if an existing node set is to be operated on, such
steps as entity replacement and attribute defaults would presumably already
have had to occur (and if they didn't occur, the c14n algorithm would have
no way of knowing this). This exposes a subtle point; even with existing
c14n, the actual output could depend on the parser settings used.

Finally, C14N 1.x explicitly do NOT allow for non-DTD schemas to be
incorporated into the octet-stream -> node set transformation, which is to
say that none of the changes that validation by XSD might cause are supposed
to be in scope for the result of c14n. This means for example that defaults
defined in an internal DTD subset *are* injected, but defaults defined in a
schema are not. This again shows how significant the difference between
passing a node set or an octet stream to c14n would be.

Current Draft of C14N 2.0

The current draft proposal includes a pair of options that are related to
this issue, ignoreDTD and expandEntities. The latter is defined as "if set
to true ignore all entity declarations, and expand only the predefined
entities (lt, gt, amp, apos, quot) and character references".

The former presumably implies the latter and expands the steps to *skip* to
include DTD-defined default attribute values. It also mentions skipping
attribute value normalization; we should determine whether that in fact is
depending on DTD processing. (It may be true that the results of
normalization would be different if the DTD is processed, but it doesn't
seem to me that the DTD is required for normalization to happen.)

As with C14N 1.x, there are no features designed to work in conjunction with
non-DTD schema languages.

Open Issues with C14N 2.0 Draft

The current draft does not (that I can find right now) make a distinction
between processing an octet stream as input, and a set of "document
subtrees" (i.e. DOM nodes). This is something the old specs do talk about,
and I think we need to do that here as well. Essentially, we have to direct
implementers as to what parser settings have to be used when processing a
whole XML document as input, and the c14n parameters we define would
probably correspond to those options.

In thinking about this issue, it seems to me that we also have to consider
whether any such options we include correspond to behavior that an XML
processor is actually expected to support. We can decide that "all parsers
support things that the XML spec doesn't require", but that's probably
something we should be explicit about and document in the spec.

Note also that ignoring the DTD could have affects on ID attributes and thus
may argue for additional signature functionality related to identifying ID
attributes absent a DTD.

Since DTDs are very different from schemas, and since entity expansion is
generally a DTD-only feature, I disagree with trying to "unify" the
treatment of DTDs and schemas. These should be separate problems, and even
if there were options for both they would have separate ramifications
anyway.

Since schemas are essentially just ignored now, the question to address
there is whether they should be "un-ignored". The biggest argument IMHO for
ignoring them is that unlike DTDs, XML documents have no normative means of
binding themselves to a schema. There are hints and implications but nothing
normative; in fact, there are some proposals for a PI to introduce such a
concept.

In practice, schemas really introduce three components related to signatures
and c14n:

- default values
- data type normalization
- IDness of attributes
- QName content models

The first is pretty well known to be a bad idea. Simple fix: stop using
them.

The second has been worked around pretty easily in newer code by avoiding
direct mods of the DOM itself.

The third remains a big issue, but could be addressed with an explicit
Signature syntax for identifying "intended" ID attributes within the scope
of a reference. And this is also a problem with ignoring DTDs so probably
needs to be looked at anyway.

The fourth we already have some discussion around.

-- Scott

Received on Monday, 10 May 2010 17:48:20 UTC