Comments on XML Processor Profiles draft of 16 December 2011

This document contains comments on the 16 December 2011 editor's draft of XML processor profiles.

My compliments to the Working Group on the preparation of this draft. On the whole I found this draft clearer than the earlier draft I reviewed in April (email and HTML versions of my comments are in the mail archive); either my standards have decayed or it's more clearly written than the earlier draft.

Some comments do arise, which follow. First there are some issues which struck me as I read this draft; these are grouped into substantive and stylistic issues. There follows a short review of the status of my earlier comments, as far as I can tell. When references to locations in the text take the form n m or n m l, then n is a section number, m a paragraph number, and l a sentence number. Negative numbers are counted from the end of the container.

Substantive comments

I note with some disappointment that the current draft fails, like the earlier one, to take into account the property of respecting, or failing to respect, the implications of the standalone property; I think that's a lost opportunity.
I think the treatment of validation is much improved in this draft.

I continue to think, however, that validation plays a more important role in most practical characterizations of XML parsers than is reflected in the treatment of validation in this draft. (Search for “XML parser” and read people's characterizations of their own or others' parsers and I think you'll find that in about 90% of all instances, one of the first things mentioned will be whether it's a validating parser or not.) If W3C is going to issue a spec which appears to be trying to provide a single set of processor categories that most people and applications can use, then it says here that I think it's an error for its categories not to address validation more directly.

The current draft appears to define four categories of XML processing, none of which require validation. In reality it defines eight: two profiles forbidding validation and two profiles each of which can be coupled with a rule forbidding, a rule allowing, or a rule requiring validation. It defines more than eight if one takes schema languages other than DTDs into consideration.

It's probably a good idea to make it feasible for other specs appealing to this one to specify that validation is required, or allowed, or forbidden, and otherwise to require conformance to one of the named profiles. But it's a design flaw in this spec that writers of other specs are required to look so hard for instructions on how to do so. I predict that a half of all members of other working groups that read this spec with a view toward referring normatively to it will come away from a first reading with the belief that this spec defines four profiles, all of them for non-validating processors, and so these profiles cannot be required if validation is required, or perhaps even if it's allowed. (“See? It says right here: 'process the document as required of conformant non-validating processors ...'!” “Oh, OK, I guess we can't require validation.”)

If you insist on keeping validation semi-orthogonal to the four defined profiles, you might wish to add a non-normative appendix on “How to refer normatively to this specification” with instructions which say, in their bluntest form: (1) pick one or more profiles that all processors of your language must conform to; (2) specify whether processors of your language are allowed to conform to higher-level profiles or not; (3) if you are using the External Declarations or Full profile, specify whether processors of your language are (a) required, (b) allowed but not required, or (c) forbidden to perform DTD validation, and whether they should reject or continue to process invalid input.

Otherwise, it's quite likely that readers from other working groups, who have other things on their plate than thinking about the subtleties of defining XML processor profiles, will forget that they can and should say something about whether validation is allowed, required, or forbidden. They will end up with incomplete characterizations of their requirements. It can even happen to people who ARE thinking about the subtleties of processor profiles: in section 6 of this spec, for example, you offer a sample formulation of a requirement to conform to a particular profile, as (I think) a model for other specifications to use. But the example says nothing about whether validation is required or allowed or forbidden.

All that said, however, I reiterate that the treatment of validation is better in this draft than in the one I reviewed earlier.

Editorial comments

1 2 1. For “there are exists not only optionality in the XML Recommendation itself” perhaps read “there are not only optional features in the XML Recommendation”.
1 3 1. For “The Infoset” substitute a reference to the bibliographic item, or perhaps read “The Infoset specification [XML Information Set]”. It is not the infoset, or an infoset, but the Infoset specification that gave the community a (or: yet another) vocabulary for discussing the information passed by a parser to an application.
1 3 1. For “the items produced by a parser” perhaps read “the information passed by an XML parser to an application”. (The vocabulary defined by the Infoset spec does not apply to parsers in general but only to XML parsers. And the term “item” has a technical meaning in the context of parsing that is not at all what you mean here. And outside the context of parsing it doesn't mean anything in particular.
1.1 1 2. You say

XML applications are often created by building on top of the [XML Information Set] vocabulary or XML data models such as [XML Path Language (XPath) Version 1.0] or [XQuery 1.0 and XPath 2.0 Data Model (XDM)], understood as the output of an XML processor.

The metaphor of building on top of a foundation seems to be dragging this sentence down; if you insist on keeping it, I think you would do well to speak here of the foundations provided by the [XML Information Set] vocabulary and the XPath data models, instead of eliding the foundation and pouring concrete directly on top of the specs. This reader found himself trapped in wet concrete at this point.

But I think you might do better to replace the sentence quoted with something like

XML applications are often defined in terms of operations on instances of XML data models such as [XML Path Language (XPath) Version 1.0] or [XQuery 1.0 and XPath 2.0 Data Model (XDM)], or on information identified by terms in the [XML Information Set] vocabulary.
1.1 1 -1. For “if the input document includes uses of XInclude, for instance.” read “if the input document includes uses of XInclude, for instance, the XML processor may or may not perform the indicated inclusions”. Or finish the sentence in another way. But finish the sentence.
2 1 Substantive. It might be helpful to add the observation that

Every conforming XML processor distinguishes, by definition, between XML and non-XML input.
2 3 -2. After “For example, a data model may expose element content as an array of strings”, perhaps add “and not as an array of characters”? I had to think for a minute or two to understand what you were driving at here.
2 3 -2. In “For example, a data model may expose element content as an array of strings”, you seem to be taking the position that a “data model” can “expose” things. I think (perhaps I'm wrong) that “expose” is more normally used of APIs, and that data models would more naturally be said to “define” content as an array of string than to “expose” it.
2.1-2.4. The phrase “information corresponding to information items and properties [in a particular class]” is needlessly redundant. And it says the same thing twice. Unless, of course, it's not saying the same thing twice at all, but something subtly different which needs explanation. Why are you spitting in your readers' faces this way?

If you are going to follow the Infoset spec's terminology, I think you might do worse than follow that spec's usage. The Infoset spec speaks of information sets being “made available” (by parsers, to downstream applications), and also speaks of an information set as consisting of some number of information items. That suggests (or so it seems to me) that when a parser provides information to the application, the usage of the Infoset spec is that the parser is providing information items to the application. Providing information items — not “information corresponding to information items”.

If you mean what the Infoset spec means, would you not do better to follow its usage? They invented the term information item, after all.

If you mean something different, then (a) what do you mean? and (b) why have you not defined the terms you are using in a special sense different from that given by the Infoset spec?
2.3 list item 1 Typo. For “specifed” read “specified”. Ditto in 2.4 list item 1.
In 2.2 I found myself wondering whether conformant non-validating XML processors are required to perform ID type assignment for IDs declared in the internal DTD subset. It might be a convenience for readers who don't have the relevant specs in their favorites list if you included a note pointing out that they do, or do not, have that obligation.
4.2.3 1 -1. For

... may replace some (XInclude) Element Information Items ... with some amount of different information, corresponding to Element, Attribute, Character, Comment, Namespace and Processing Instruction Information Items

read

... may replace some (XInclude) Element Information Items ... with some number of different Element, Attribute, Character, Comment, Namespace and Processing Instruction Information Items.

If (as your usage seems to want to suggest) information items are not simply pieces of information provided by the parser to the application, then in this case your current formulation is substantively wrong: XInclude processing operates on infosets and produces infosets. But an infoset is, by definition, a set of information items. If the processor replaces information items with some kind of information which is not itself an information item but only corresponds to an information item, then its output cannot be an information set.

If, on the other hand, information items are simply pieces of information provided by a parser to the application, then the current wording seems to be seeking verbosity as its own reward.
3. I'm glad to see the spec making use of the terms “implementation-defined” and “implementation-dependent”, but it's a discourtesy to your reader not to define them in this specification.

You also don't actually use the term “implemenation-dependent”, so you don't need to refer to it. If you do want to refer to it, and you want your usage to align with that of QT, you will want to replace “-determined” with “-dependent”.
2 (Important). It would be a lot easier to see the differences in item 1 of each profile if they didn't begin with the same eighty characters. How many people scanning these lists for a first orientation will even notice that beginning in character 83 or so, the first items of the lists diverge from each other?

Since the obligation to “Process the document as required of conformant non-validating XML processors” does not distinguish the profiles from each other, it could be dropped, or moved outside the lists of distinctive features, without loss. If you do insist on including it in each list, repetitive and uninformative though it is, then at least break the item in half. For example:
1. Process the document as required of conformant non-validating XML processors;
2. Refrain, in so doing, from reading any external markup declarations;
3. Maintain the base URI of each element in conformance with [XML Base];
  
  ...
Passim. You still have a stray capital E or two in uses of the word “element” that do not begin a sentence. If the phrase “element information item” were always capitalized consistently, this would be understandable, if unnecessarily ugly. But it's not capitalized consistently, so that excuse is not available to you.

Status of my earlier comments

I've reviewed my earlier comments and summarize here what I believe to be the state of play with respect to each of them.

1. Choice of facets for characterizing processors

Significantly improved. The introduction does a reasonable job of explaining the rationale for the features selected.

In 1.1 3, I think the explicit statement that the profiles don't address the preservation of invariants during modification or incremental construction is helpful. I think it would also be helpful to add in this section an explicit statement that the profiles don't address the choice of API, memory model, or the distinction between tree- and event-based interfaces.
2. Respect for the stand-alone declaration

Not addressed that I could see.
3. Validating processors

Partially addressed. My residual discomfort is discussed above.
4. Definitions of terms

Much improved. I notice that you have addressed this in large part by deleting all usage of several of the terms I suggested should be defined, rather than take the trouble to define them. But you seem to have replaced them with words that I found clearer; the only terms I'd still like to see defined here are “profile” and “processing” (specifically of declarations).
5. Are the profiles disjoint?

Resolved. Thank you.
6. Identification of xml:id attributes as IDs

Resolved. Thank you.
7. Processing of external declarations

Not resolved that I can see.
8. Providing information items

Resolved. Thank you.
9. Data models and information sets

Resolved. Thank you.
10. Rigidity

Resolved. Thank you.
11. Relation of profiles to current practice

Not addressed as far as I can see.
12. Implementability of the spec

Resolved. Thank you.
13. Conformance clause

Resolved. Thank you.
14. Documentation of implementation-defined features

Resolved. Thank you.
15. The information expressed in XML documents

Not resolved. The current wording does its best to suggest that no information not given a name by the Infoset spec can be provided to applications. Suggested rewording:

For the profile definitions above and the invariants below, we define a number of (overlapping) classes which categorize the information items and their properties defined in [XML Information Set].
16. The information classes

Resolved. Thank you.
17. Recursive XInclude processing

Resolved. Thank you.
18. Minor editorial points, typos, etc.

Most resolved, some not. For your convenience, I repeat here the ones for which I don't see a resolution and which have not already been reiterated above.
- In section 1, horizontal ellipses are used with whitespace between the full stops without whitespace before or after the ellipsis.
  
  For “a software module. . .used”, read “a software module ... used” or optionally “a software module … used” (the latter using the standard hellip entity for character U+2026).
- In 1.1, the paragraph about base URI says the term is used “as it is defined in [RFC 3986]”. But RFC 3986 does not provide any definition properly so called for the term base URI. It specifies rules for establishing and using a base URI, but it does not “define” it.
  
  I think what is meant is that XPP assumes that the base URI is established and used as specified in RFC 3986. So perhaps read
  
  A base URI is an absolute URI against which relative URIs are applied; this specification assumes that base URIs are established and used as specified in [RFC 3986].
  
  But you should probably also decide whether XPP assumes it or requires it.

Once again, thank you for your work. Good luck with the document.

Comments on 16 December 2011 draft of

XML processor profiles

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

5 January 2012

Substantive comments

Editorial comments

Status of my earlier comments