Notes on XML Information Set Working Draft of 20 December 1999

C. M. Sperberg-McQueen

28 January 2000

This document provides a personal technical review of the XML Information Set in the W3C Working Draft of 20 December 1999.

The comments in this review are the personal notes of the author, and do not necessarily reflect the views or policies of other persons, institutions, bodies, consortia, or funding agencies with which the author may be associated.

The review follows the structure of the Infoset draft. Large parts of it were composed in the course of reading, and so may not take into account information provided in later sections of the spec. I have left some such discussions intact in revision because they help identify points about which your readers may be confused.

1 Introduction

para 5, in the description of the maximal information set, for "all the core and all the peripheral items with all the peripheral properties" read "all the core and all the peripheral items with all the core and peripheral properties".

para 5, last sentence uses the term consistent, which is undefined. What constitutes consistency as between two information sets? This is an important enough concept to be defined fully. I believe what is meant by S1 is consistent with S2 is that they can be merged into a single information set M such that for each S in {S1, S2}:

Each information item in S is in M.
Each property in S is in M.
For each property P in S, the value of P in S is
- identical to the value of P in M, if the value is atomic
- a subset of the value of P in M, if the value is a set
- each item in the value of P in S also appears in the value of P in M, in the same order, if the value is a sequence

(Formally: S1 is consistent with S2 iff there exists some information set M such that S1 subsumes M and S2 subsumes M.) But it would be nice not to have to guess, if only because different people won't always make the same guess.

Paragraph 5 appears to imply that a processor which claims to conform to this spec must not supply information to downstream applications if that information is not defined as part of the maximal information set. If this implication is not intended (as may be inferred from the section on conformance), the paragraph needs to be revised, perhaps by changing the initial clause to read "For any given XML document, ~~there are~~ a number of ~~corresponding~~ information sets are defined by this specification: ...".

Since in practice specialized processors of various kinds (XML Schema processors or link processors, to name two) will need to provide information not included in the maximal info set defined here, it is highly desirable that the Infoset spec be supplemented by a specification describing explicitly how to define new packages or modules of information items which must, may, or should be provided to downstream apps by specialized processors of various types. Such an explicit description would allow the task of elaborating the info set for various specialized purposes to be distributed, and avoid a continuous stream of requests that the infoset be extended to handle this or that specialized form of information. My personal view is that such a description should be made part of the Infoset spec itself. Others have said they fear it would delay the Infoset spec, and that the meta-description should be in a separate document.

A first cut at a metadescription is given in the comments by the XML Schema WG. Something along those lines would, I think, help ensure that third parties can tell the difference between a definition of a new set of information items and a handsaw, and would thus be A Good Thing.

In para 6, for "In the case that an entity reference cannot be expanded," I think the spec should read "In the case that an entity reference cannot be or has not been expanded," since XML processors are not allowed to expand or not expand entity references at will, and if the Infoset spec is to define the information set for XML 1.0, it should not attempt to change that rule.

The note after paragraph 7, which refers to RFC 2119, strikes me as needlessly opaque. Is there a serious objection to giving the definitions here, so that readers who don't carry the RFC library around with them can know how these terms are used in this spec? The only objection I can think of is that, if RFC 2119 were to be revised, you would want the new definitions automatically made part of this spec -- please say it ain't so.

2 Information Items

2.1 The Document Information Item

2.1.1 Document: Core Properties

S.v. children: is reference defined? The specification that various kinds of information are supplied in the form of references seems like inappropriate physicalization to me: is there any reason not to leave it open to implementors whether to supply the information by reference, by name, or by copy? At the very least, specify that by reference you do not mean to imply anything in particular about how the property is presented. (As far as I can tell, there is nothing in the infoset as currently defined which would require reference semantics instead of copy semantics, or which would allow any distinction between them to be made or illustrated. If there is some sense in which the difference between copies and references matters, the spec should identify it explicitly.)

The property children is defined so as to require that all conforming processors read external entities referenced from the DTD ("The list must contain ... a reference to one processing instruction information item for each processing instruction preceding the document element (either in the document entity or in a lower-level entity) or following the document element"); this seems incompatible with the rule in XML that non-validating processors are not required to read external entities. Does the WG really wish to require that all infoset-conforming processors read the entire DTD, or is there an error in the definition? In the section on conformance, you appear to backpedal and say processors don't have to read external entities -- which seems to be in direct contradiction to this definition.

The wording of the definition of children makes explicit provision for the appearance of other items in the sequence, which raises two questions in my mind:

Do other properties with sequence values have a similar openness? (E.g. may the entities property contain references to entity items for parsed entities, or is it restricted to items for unparsed entities?) If the definitions given are expected to be exhaustive (i.e. if I cannot add parsed entities to the entities property), that should be said clearly at least once and possibly more than once. If definitions are exhaustive, it's not completely clear to me how an information set which defines new kinds of information items can be integrated with this one. (Do I have to define a whole parallel set of children-augmented-with-foo-items properties?)
Does the openness defined for document children mean that other information items of any kind can appear in the sequence, or only other information items of the kinds specified in section 2.1.2?

The properties notations and entities are likewise defined so as to require that all conforming processors read external entities referenced from the DTD ("... one for each notation declaration in the DTD", "one for each unparsed entity declaration in the DTD"); this seems incompatible with the rule in XML that non-validating processors are not required to read external entities.

The definition of the entities property seems awkward: it could be read as requiring that an info-set-conforming processor supply an information item for each entity declaration which it does not parse, but not for those it does parse. Since the processor must parse the declaration in order to supply the information item, but need supply the information item only for declarations which are unparsed, may a malicious implementor (or a simple-minded theorem prover) conclude that the entities property may always be an empty set? Perhaps it should read "one for each unparsed-entity declaration in the DTD" or "one for each unparsed entity declared in the DTD".

The definitions of entities and notations should perhaps simply say "A set ..." rather than saying redundantly "An unordered set ...", which suggests that the authors believe that sets are usually ordered.

2.1.2. Document: Peripheral Properties

The definitions of children and of children - doctype appear to be mutually incompatible for any document which has processing instructions in the DTD. Consider the document

<?xml version="1.0"?>
<?alpha linkprocess strong; colors 16; security-level #undefined?>
<!DOCTYPE demo SYSTEM "file:///usr/local/lib/demo.dtd" [
<?lambda enforce="all" relax="idref idrefs" ?>
<!ENTITY % x.date 'gregorian | julian | mosaic |' >
<!NOTATION gregorian PUBLIC "-//ESCR//NOTATION Gregorian Calendar//LA" 
  "http://www.vatican.org/Calendar/" >
<!ENTITY % extensions SYSTEM "file:///usr/local/lib/x/ext.dtd">
]>
<?omega ha-ha?>
<demo/>

The children element should contain at least

PI: alpha linkprocess strong; colors 16; security-level #undefined
PI: lambda enforce="all" relax="idref idrefs"
PI: omega ha-ha

as well as any processing instructions declared in the file ext.dtd. If a processor wishes to add an information item for the document type declaration, the definition of children - doctype specifies (1) that there must be exactly one reference to exactly one DTD information item, and (2) that its position in the list "must reflect its position in the original document". But where should the DTD info item go? In the original document, the document type declaration does not occur before the PI for alpha, nor after the PI for alpha and before the PI for lambda, nor after the PI for lambda and before that for omega, nor after the PI for omega. But there are no other positions available in the sequence. The document type declaration begins before, and ends after, the PI for lambda. it is not possible to insert a single information item for it into the sequence just given, and have the position of that information item reflect the location of the DTD in the instance.

I think it would be preferable to require that a DTD information item be provided, if a DTD is provided, and to place the PI for lambda among its children. That would materially simplify any process of defining information sets for schema material in DTD notation. (As it is, any such information set would be required basically to abandon all the properties of the current document information item, and build a new set of properties from scratch, instead of building on the existing set of properties.) Other solutions might be preferred. But some change must be made: the current definitions make it impossible to construct a children property for the document which conforms to all the definitions.

2.2 Element Information Items

The definitions of the properties namespace URI and local name refer to something called the "element's name". What is that? Elements are objects which occur in documents; where I come from, names are strings or tokens used to identify things. Some elements in a document instance do have unique identifiers: the only plausible interpretation I can find for the term "element name" is "the value of the unique identifier supplied for an element, by means of an ID attribute". But does that mean that any element which doesn't have an ID attribute gets an empty value for the [namespace URI] and [local name] property?

And where does the element's element-type name go?

Various informants tell me that by the misleading and ill-chosen term element name you wish to refer to what in SGML is called the generic identifier, and is often called the element type name.

Oh. Well, I have to say I hate this usage.

There is no need to invent a new term, and I urge you to eliminate this bad and misleading terminology from this spec before it spreads further than it already has. If you cannot bring yourselves to use either the term element-type name ("The Name in the start- and end-tags gives the element's type." -XML 1.0, sec. 3.1, after productions 40-41) or generic identifier, I'd rather that you use the term tag-name (glossed not as 'name of the tag' but as 'name which is required to appear within the tag') -- I don't much like it, either, but it's less actively confusing than element name.

The definition of property attributes seems to contradict both the XML spec and the "Namespaces in XML" spec, by stipulating that namespace declarations are not represented as attribute information items. Namespace declarations are clearly attribute value specifications within the meaning of the XML 1.0 spec (production 41), and the "Namespaces in XML" spec explicitly states that they are in fact attributes. Unless the charter of the Infoset WG includes a charge to revise the terminology of XML 1.0 and Namespaces, the infoset spec seems to be going out of bounds here.

It is useful, for practical purposes, to be able to distinguish namespace declarations from other attributes. But it would be good, I think, to accomplish this by some means other than a flat contradiction with the XML 1.0 and Namespaces specs.

The definition of property declared namespaces calls for namespace-declaration information items, "one for each of the namespaces declared in this element". I think what is meant is (or should be) "one for each of the namespaces declared in the start-tag of this element". Otherwise, the relevant property cannot be emitted until after the entire contents of the element are scanned. Consider the example

<!DOCTYPE foo [
<!ELEMENT foo ANY>
<!ATTLIST foo
          xmlns:a CDATA #IMPLIED
          xmlns:b CDATA "."
          xmlns:z CDATA "http://www.zzz.zzz.com/z">
<!ELEMENT bar ANY>
<!ATTLIST foo
          xmlns:c CDATA #IMPLIED >
]>
<foo xmlns:a="http://aaa.aaa.com/a"
     xmlns:b="http://bbb.bbb.com/b">
<c:bar xmlns:c="http://ccc.ccc.com/c"/>
</foo>

It is clear that in this example, the namespace with URI http://ccc.ccc.com/c is defined in the foo element. (The foo element is contiguous. The declaration of namespace http://ccc.ccc.com/c occurs after the beginning of the foo element, and before its ending. If that doesn't satisfy the definition of in, I don't know what does.)

In the same definition, the meaning of the phrase "non-#IMPLIED namespace declarations specified or defaulted for the element" is not clear to this reader. Is the namespace with URI http://www.zzz.zzz.com/z to be represented in the information item or not? For that matter, is the namespace with URI http://ccc.ccc.com/c to be represented or not? The attribute (sic) named xmlns:c is an #IMPLIED attribute, so there are no non-#IMPLIED namespace declarations attached to the bar element in the example. Would the definition produce the right results if it read

[declared namespaces] A set of references to namespace declaration information items, one for each namespace for which a URI is (a) provided in the start-tag of this element, or (b) provided in the DTD or schema for this element type. Namespace declarations for which no URI is provided in the document or in the DTD are not included in the property.

The definition of property children - entity markers seems to require that no entity be referred to twice (it's a set, not a sequence), and also that the pairs be ordered. It is hard to know how to represent the entity boundaries in the following example if the pairs must be ordered, because the pair for entity B can neither precede that for entity A, nor follow it.

<!DOCTYPE demo [
<!ELEMENT demo (#PCDATA)>
<!ENTITY B "happy">
<!ENTITY A "a very &B; birthday">
]>
<demo>I wish you &A;, and many happy returns.</demo>

Suggested alternate wording:

A sequence of references to entity-start and entity-end information items. Each entity-start information item must be paired with with a single entity-end information item, and vice versa. These information items, if provided, must be added to the ordered list of children of the current element. The relative position of each marker information item in the list must reflect its position in the original document.

The definition of property in-scope namespaces refers to "the preceding set", but it seems unlikely that either the property base URI or the property children - CDATA markers is meant. Should the reference be to the declared namespaces property?

2.3 Attribute Information Items

Para 1,

There is one attribute information item for each attribute (specified or defaulted) for each element in the document instance. Namespace declarations are represented using namespace declaration information items, not attribute information items.

is self-contradictory. Unless this spec provides an alternative definition for the term attribute, most readers will and should assume that the definition in the XML 1.0 spec applies. But according to that definition (and according to the namespaces spec), namespace declarations are attributes.

I think it's a design error to omit attribute information items for attributes with implied values. What is the motivation for omitting them? I would have guessed that the motive was to avoid requiring non-validating processors to read the entire DTD, but the properties of the document item clearly require that the entire DTD be read, regardless. So I am puzzled.

The definition of property attribute type should be expanded in two ways. For schemas expressed in DTD notation, it would be useful to distinguish notation attributes from other enumerated types. For schemas expressed in other notations, constraining the legal set of values in this way is problematic, since no new language is likely to have the same set of attribute types that DTDs do. N.B. XML 1.0 does not and cannot prohibit the use of schema languages other than DTDs. It thus cannot be argued that provision for schemas in non-DTD notations is inconsistent with XML 1.0. If it is desired to restrict this Infoset spec to DTD-based schema information, that's OK but this restriction should be stated explicitly, as it goes beyond the restriction of the infoset to XML 1.0.

The property children - entity markers seems to suggest that entity boundaries can be unambiguously assigned positions in the normalized form of an attribute value. This is not obviously the case: where do the entity boundaries go in the following example, in which the normalized form of the attribute value for children is "foo bar"?

<!DOCTYPE tree [
<!ELEMENT tree (n*) >
<!ELEMENT n EMPTY>
<!ATTLIST n
          id ID #IMPLIED
          children IDREFS #IMPLIED >
<!ENTITY B " foo ">
<!ENTITY A " bar ">
]>
<tree>
<n id="root" children=" &A; &B; "/>
<n id="foo"/>
<n id="bar"/>
</tree>
]]>

I'm not sure what the best solution is: but either provide some rules for deciding who gets contested white space in cases like this one, or else provide a property to hold the unnormalized form of the attribute value.

The definition of children - entity markers is better here than in the section on element items.

2.4 Processing Instruction Information Items

In para 1, for "The XML declaration and text declarations for external parsed entities are not considered processing instructions" read "The XML declaration and text declarations for external parsed entities are not ~~considered~~ processing instructions", or "The XML declaration and text declarations for external parsed entities are not formally ~~considered~~ processing instructions".

I think it would be useful to provide peripheral information items for the XML and text declarations; failing to provide a method to pass the XML version number creates, at the very least, a large challenge for interoperability when the day comes when multiple versions of XML and/or this infoset spec exist.

2.5 Reference to Skipped Entity Information Items

The text of para 1 appears to contradict para 6 of the introduction, which seems to imply that entity references are to be skipped only if they cannot be expanded. I like this version better. (Actually, as a matter of design I like the other version better, but non-expansion of entities was a conscious design decision in XML.)

2.6 Character Information Items

Is non-markup character defined? It should be.

2.8 The Document Type Declaration Information Item

This information item provides a remarkably slender foundation for further development by processors which wish to provide information about the DTD to downstream applications; it would be a better foundation if the children property were divided into (or supplemented by) two properties, one for the internal and one for the external subset, since so many DTD-oriented and DTD-aware applications will care about the differences between internal and external subsets in ways that will make the existing children property useless to them.

Note that the definition of the children property requires that comments and processing instructions within the DTD be duplicated (once here, once in the children property of the document item). This seems unnecessarily confusing. It would be better simply to require a DTD item if a DTD occurs in the instance, and put the entity and notation information where it belongs.

The generic identifier of the root element really ought to be a core property of the DTD item.

2.10 Notation Information Items

The property base URI seems to require that a system identifier be supplied; perhaps (a) add "if any" or "if a system identifier is supplied" and (b) make it peripheral, like most other occurrences of properties with this name.

2.11 Entity Start Information Items

In para 1, for "an peripheral part of the information set" read "a peripheral part ...".

2.12 Entity End Marker Information Items

In para 1, for "an peripheral part of the information set" read "a peripheral part ...".

2.13 CDATA Start Marker Information Items

In para 1, for "an peripheral part of the information set" read "a peripheral part ...".

2.14 CDATA End Marker Information Items

In para 1, for "an peripheral part of the information set" read "a peripheral part ...".

2.15 Namespace Declaration Information Items

Para 2 repeats information about what does and does not count from the discussion of element information items; up til now I had thought the spec was fairly consistent in placing such information in the definition of a property, rather than in the definition of the information item. On the other hand, this description is clearer than the one in section 2.2.

The definition of children here, unlike any elsewhere in the spec, talks about optional entity boundary markers. For editorial consistency, a property children - entity markers should be defined (as for other info items).

It's not clear to this reader what the upshot of the dancing around core and periphery in these definitions is. I think the spec is saying "either namespace URI or children or both must be provided" -- but the fact that you don't say this in so many words makes me doubt my understanding. What is going on here?

Notes on XML Information Set Working Draft of 20 December 1999

C. M. Sperberg-McQueen

28 January 2000

1 Introduction

2 Information Items

2.1 The Document Information Item

2.1.1 Document: Core Properties

2.1.2. Document: Peripheral Properties

2.2 Element Information Items

2.3 Attribute Information Items

2.4 Processing Instruction Information Items

2.5 Reference to Skipped Entity Information Items

2.6 Character Information Items

2.8 The Document Type Declaration Information Item

2.10 Notation Information Items

2.11 Entity Start Information Items

2.12 Entity End Marker Information Items

2.13 CDATA Start Marker Information Items

2.14 CDATA End Marker Information Items

2.15 Namespace Declaration Information Items

4 Conformance

5 What is not in the Information Set

6 References