Comments on XML Processor Profiles draft of 24 January 2013

This document reviews the current status of comments made by this reviewer on earlier drafts (12 April 2011 and 16 December 2011) of the XML processor profiles specification; it takes the 24 January 2012 editor's draft as its reference point. Specifically, this document records, for each item in the comments of 5 January 2012 (email), and for those items in the comments of 15 April 2011 (email) which had not yet been addressed in January 2012, whether the issue raised appears to have been addressed or not, and if addressed whether it has been resolved or not in the view of the reviewer, in the draft of 24 January 2013.

A few remarks on the new draft are also included, but this document does not reflect a complete review of the current draft.

1 New issue

The following issue appears to be independent of those raised in January 2012 and April 2011.

Sec 3, under "Unparsed Entity Reference Information Item", the draft says in a note: This type of information item will not occur at all if standalone="yes". There are two problems. First, this type of information item will not occur if standalone="yes" is correctly supplied; it may well occur in input in which standalone="yes" is erroneously specified. Second, the note is missing a final full stop.

2 Issues resolved

The following items raised in January 2012 appear to have been resolved in the draft of January 2013.

All the comments labeled editorial, except as noted below. Thank you.
Suggestion to add a remark to section 2.1.

The review of January 2012 suggested adding the following text to section 2.1:

Every conforming XML processor distinguishes, by definition, between XML and non-XML input.

This text has not been added, but it no longer seems necessary or particularly helpful to the reviewer, so this issue should count as resolved.

The following item raised in April 2011 appears to have been resolved in the draft of January 2013.

Issue regarding the alleged definition of base URI by RFC 3986. I think the new definition of this term works correctly.

3 Issues partially resolved

The following items raised in January 2012 and/or April 2011 appear not to have been resolved successfully in the draft of January 2013.

3.1 Validation

In both earlier reviews, this reviewer expressed a desire for a cleaner and more explicit treatment of validation. The review of January 2012 said (in part):

I continue to think, however, that validation plays a more important role in most practical characterizations of XML parsers than is reflected in the treatment of validation in this draft. (Search for “XML parser” and read people's characterizations of their own or others' parsers and I think you'll find that in about 90% of all instances, one of the first things mentioned will be whether it's a validating parser or not.) If W3C is going to issue a spec which appears to be trying to provide a single set of processor categories that most people and applications can use, then it says here that I think it's an error for its categories not to address validation more directly.

The current draft appears to define four categories of XML processing, none of which require validation. In reality it defines eight: two profiles forbidding validation and two profiles each of which can be coupled with a rule forbidding, a rule allowing, or a rule requiring validation. It defines more than eight if one takes schema languages other than DTDs into consideration.

The new non-normative section 7 is an improvement, in that it addresses the related issues more clearly and makes it less likely that other WGs seeking to refer to this specification for a normative statement of requirements for their XML input will botch the job.

It continues to trouble me, however, that a spec whose avowed aim is to allow other specifications to "establish precisely what input processing they require" (my emphasis) goes out of its way to make it onerous to specify validation. The only two profiles which can be referred to simply, without an additional clause specifying whether validation is forbidden, required, or allowed, are the two profiles which forbid validation. This is bad social engineering, unless your goal is to discourage users of XPP from using it to require validation of any kind. If the goal is to be neutral with respect to validation, the current draft fails to achieve it, by a wide margin.

The new section 7 also makes clear that this specification, in its current form, does not succeed in addressing the requirement which drove the creation of the working group, even in an attenuated form. Those who attended the W3C's XML Processing Model workshop in 2001 will recall that some prominent attendees hoped for a specification which would nail down, once and for all, the sequence in which processes like validation, XInclude processing, etc., should be performed. The minutes of the closing session can usefully be consulted by those who have forgotten.

The working group has quite rightly declined to draft any spec of the kind initially envisaged, and I don't want to reopen that discussion. But the very least that this specification should do towards satisfying that initial requirement is provide a short name for one or more answers to the original question. This the current draft fails to do.

It is the failure of the WG to integrate validation adequately into the classification scheme — or perhaps it would be more accurate to say the WG's determined efforts to exclude validation from the processor profiles — which has rendered the spec unable even to put a name to an answer to the question raised at the workshop.

This issue must therefore still be regarded as not satisfactorily resolved.

3.3 Miscellaneous small points

Several small points have been addressed in part in the current draft, though they have not been resolved to this reviewer's complete satisfaction.

Thank you for suppressing the space between the full stops in your ellipses; it improves legibility somewhat.

I continue to think that most manuals of style prescribe white space before an ellipsis, and you have not made me think differently. Chicago (13), for example, says “ellipsis points … are usually separated from each other and from the text and any contiguous punctuation by 3-to-em spaces” — I think a blank character comes closer to this, in an average Web browser, than no space at all. And use of the public entity hellip (U+2026) will produce better spacing between the dots than three literal full stops.
Thank you for including definitions of the terms implementation-defined and implementation-dependent; these take the awkward form "The term implementation-defined indicates an aspect that may differ between implementations ...". Perhaps better "an aspect of processor behavior"?
In January 2012 this reviewer wrote:

In 2.2 I found myself wondering whether conformant non-validating XML processors are required to perform ID type assignment for IDs declared in the internal DTD subset. It might be a convenience for readers who don't have the relevant specs in their favorites list if you included a note pointing out that they do, or do not, have that obligation.

The current draft adds a note which helps address this question, but it's ill-drafted: it reads:

This profile, like the 2.1 The basic XML processor profile, reads only declarations in the internal subset, this means that types, such as ID, that appear in declarations in the internal subset will be processed while such declarations in the external subset will not.

First, oughtn't it to be processors, not profiles, which read things? Second, types, such as ID and declarations are not really parallel. I think what is meant is something like: "Processors conforming to this profile, like those conforming to 2.1 The basic XML processor profile, read only declarations in the internal subset of the DTD, not those in the external subset. In consequence, declarations specifying that attributes have type ID will be processed if they appear in the internal subset, but such declarations will not be processed if they appear in the external subset."
Thank you for making your capitalization of the terms element information item (etc.) consistent.

I do wish you had done so by moving in the direction of normal English usage. As an innovation, returning to seventeenth-century capitalization rules lacks charm.

Leaving normal English usage aside, I do not understand why you choose to deviate from the usage of the Infoset spec, which lowercases these terms except in section titles and other passages using title case.

4 Issues not addressed

The following items raised in April 2011 and/or January 2012 appear not to have been addressed at all in the draft of January 2013.

4.1 Correctness of standalone processing

In April 2011, this reviewer wrote:

It would be helpful, I think, for the processor profiles to distinguish more carefully the different behaviors possible with regard to the stand-alone declaration in the input XML document.

All declarations are read and handled appropriately, so documents with standalone='no' are processed without information loss.

No external declarations are read if standalone='yes'; if standalone='no' then external declarations are read, so all documents are processed without information loss.

No external declarations are read; if standalone='yes', the document is processed without information loss, and if standalone='no', the processor signals an inability to process the document without the possibility of information loss.

No external declarations are read, so documents with standalone='yes' are processed without information loss, and information will typically be lost in the processing of documents with standalone='no'. (Since documents may have standalone='no' even if standalone='yes' would be permitted, there can be cases where no information is lost in practice.)

In particular, it would be helpful for users of XML and for writers of specifications for XML-based processing to distinguish the last case from the others, in order to exclude it.

Suggested fix: augment the basic profile to require either that external declarations be read when necessary or that the processor signal an inability to handle non-standalone documents properly. Optionally also keep the profile now called basic, giving it a new name (personally, I could go for “sub-optimal”, but some people might think that that name was ungenerous).

In January 2012 I wrote on the same topic:

I note with some disappointment that the current draft fails, like the earlier one, to take into account the property of respecting, or failing to respect, the implications of the standalone property; I think that's a lost opportunity.

I repeat these comments in full because I see no traces of any effort to resolve this issue: no changes in the specification, no discussion of the issue with the reviewer.

4.2 Hamming distance of item 1 in the profiles

The comment made in January 2012 about item 1 of the four profiles still stands.

(Important). It would be a lot easier to see the differences in item 1 of each profile if they didn't begin with the same eighty characters. How many people scanning these lists for a first orientation will even notice that beginning in character 83 or so, the first items of the lists diverge from each other?

Since the obligation to “Process the document as required of conformant non-validating XML processors” does not distinguish the profiles from each other, it could be dropped, or moved outside the lists of distinctive features, without loss. If you do insist on including it in each list, repetitive and uninformative though it is, then at least break the item in half. For example:
1. Process the document as required of conformant non-validating XML processors;
2. Refrain, in so doing, from reading any external markup declarations;
3. Maintain the base URI of each element in conformance with [XML Base];
  
  ...

If there is a goal whose achievement is aided by making the profiles harder to scan, I do not know what it is. If there is a problem with making it easier to see where the profiles are the same and where they differ, I do not know what it is. The WG's consistent policy of avoiding discussion with reviewers has not made it easier to understand the WG's position and has not managed to persuade me that the current wording is better than the alternative suggested in 2012.

4.3 Information corresponding to information items

The comments made in April 2011 and January 2012 still apply.

April 2011:

In 2, the clauses about faithful provision of the information in the document all take the form “Faithful provision of the information ... corresponding to information items and properties ...”.

Perhaps it would suffice to provide, or expose, the information items and properties specified.

If it is absolutely necessary to provide not the information items and properties themselves but instead information corresponding to (but, implicitly, not identical to?) the specified items and properties, then I think the spec has an obligation to explain clearly what the difference is, and why exposing the items and properties does not satisfy the requirements of the spec. In particular, you need to provide an answer to the reader who is asking “How can a piece of information correspond to an information item without being indistinguishable from it (qua information) and thus without being that information item?”

The editors might do well to review their dusty copies of Strunk and White's Elements of style, especially the maxim “Omit needless words”, and to revise accordingly. If they do, the individuals corresponding to their readers will feel an emotion corresponding to gratitude. (Or, at least, a diminished desire to seek out sharp objects and perform dangerous acts with them.)

January 2012:

2.1-2.4. The phrase “information corresponding to information items and properties [in a particular class]” is needlessly redundant. And it says the same thing twice. Unless, of course, it's not saying the same thing twice at all, but something subtly different which needs explanation. Why are you spitting in your readers' faces this way?

If you are going to follow the Infoset spec's terminology, I think you might do worse than follow that spec's usage. The Infoset spec speaks of information sets being “made available” (by parsers, to downstream applications), and also speaks of an information set as consisting of some number of information items. That suggests (or so it seems to me) that when a parser provides information to the application, the usage of the Infoset spec is that the parser is providing information items to the application. Providing information items — not “information corresponding to information items”.

If you mean what the Infoset spec means, would you not do better to follow its usage? They invented the term information item, after all.

If you mean something different, then (a) what do you mean? and (b) why have you not defined the terms you are using in a special sense different from that given by the Infoset spec?

In most other parts of the specification, the English relating to processors, information, and information items has become clearer and more natural. The locution "information corresponding to information items" persists only here. As far as this reviewer can tell, all the arguments brought forward in the earlier reviews continue to apply: the infoset spec continues to say what it said when first published, and the unnecessary and pointless obfuscation in this phrase continues to make me want to poke my eyes out with a sharp stick. Is there any argument in favor of the current wording? Or is the WG still under the influence of those who prefer to speak as if the infoset spec defined an API (and thus a particular format for infomration) instead of defining named packets of information, independent of format?

4.4 Definitions

The spec appears still to lack definitions for key terms, including most prominently profile and processing (specifically processing of declarations).

This means that the issues originally raised as issue 4 and issue 7 in the review of April 2011 remain unresolved. If the WG's belief is that the reference to the XML spec suffices as a gloss of the words "reading and processing all external markup declarations", then I regret to inform you that this reader does not find any useful distinction between reading and processing in that document. What in the world do you think these words mean?

4.5 Relation of profiles to current practice

The comment originally made in April 2011 continues to apply.

When profiles are defined for a new specification, they involve predictions about which kinds of variation in processor behavior are likely to be interesting and useful to developers and users. In the case of new specifications, there is no existing practice that could be appealed to as a justification of the classification or profiles, or to provide examples of software fitting one profile or another.

That is not the case here, and I think the specification should not progress until an empirical survey of existing processor characteristics is performed, as a simple way of field-testing the profiles defined here for applicability in the real world and of clarifying the intent of the profiles by providing examples, where applicable, of existing interfaces or processors that satisfy the profile.

In particular, I could have sworn (but I am too lazy to look it up now) that I had used some parser interfaces which did not provide access to namespace prefixes, and other interfaces which provided only inconvenient access to namespace names. Is a set of profiles which assumes that namespace name, local name, and prefix are always all three provided a good match for a world in which some parser implementors give their users a choice (prefix plus local name or namespace name plus local name)?

Note that actually classifying real parsers will require a crisp definition of what it means to make a particular information item available; that will be a good thing, although it is likely to involve some work.

Suggested fix: Identify ten or twenty existing XML processors with different behaviors (for purposes of this exercise, all conforming SAX processors may well turn out to be alike; ditto for conforming DOM parsers). Using the definitions given in XPP, identify which profile(s) each parser matches, if any. If there are significant numbers of parsers which match no profile, consider whether the profiles need to be revised to provide a better connection with existing practice. Use a non-normative document to provide examples of processors matching the different profiles.

I continue to believe that this specification should not advance until its use in classifying actual processors is tested as described above. And since revisions of the spec since April 2011 have identified an explicit goal of supporting other specifications, the field-testing describe above should be supplemented by a similar field-test, showing that current specifications based on XML (W3C or other) could have used the classification offered in this specification to simplify their wording.

I think establishing the applicability of this specification's classification to existing processors and existing specs should be a pre-requisite for advancement beyond Candidate Recommendation. Please consider this a formal objection to any plan to do otherwise (to the extent that a member of the public has standing to raise formal objections).

4.6 The information expressed in XML documents

Issue 15 in the review of April 2011 remains unresolved and unaddressed.

Section 3 begins
For the profile definitions above and the invariants below, we categorize the information expressed in XML documents into a number of (overlapping) classes.

This is incorrect. What is characterized in section 3 is not the information expressed in an XML document, but the particular subset of that information for which the Infoset spec defines names. The two are the same neither in theory nor in practice.

Suggested fix: Replace the sentence quoted with one that's not false. Perhaps “For the profile definitions above and the invariants below, we categorize the information identified and named in [XML Information Set] into a number of (overlapping) classes.”

The comments of January 2012 also continue to apply.

Not resolved. The current wording does its best to suggest that no information not given a name by the Infoset spec can be provided to applications. Suggested rewording:

For the profile definitions above and the invariants below, we define a number of (overlapping) classes which categorize the information items and their properties defined in [XML Information Set].

Comments on 24 January 2013 draft of

XML processor profiles

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

20 August 2013