comments on XPP editors' draft of 16 December 2011

[An HTML version of the comments below is attached, for
greater legibility; a non-HTML version is given below for
those who find it more convenient.]
Comments on 16 December 2011 draft of
XML processor profiles

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

5 January 2012

This document contains comments on the 16 December 2011 editor's draft
of XML processor profiles.

My compliments to the Working Group on the preparation of this
draft. On the whole I found this draft clearer than the earlier draft
I reviewed in April (email and HTML versions of my comments are in the
mail archive); either my standards have decayed or it's more clearly
written than the earlier draft.

Some comments do arise, which follow. First there are some issues
which struck me as I read this draft; these are grouped into
substantive and stylistic issues. There follows a short review of the
status of my earlier comments, as far as I can tell. When references
to locations in the text take the form n m or n m l, then n is a
section number, m a paragraph number, and l a sentence
number. Negative numbers are counted from the end of the container.


Substantive comments

1 I note with some disappointment that the current draft fails, like
the earlier one, to take into account the property of respecting, or
failing to respect, the implications of the standalone property; I
think that's a lost opportunity.

2 I think the treatment of validation is much improved in this draft.

I continue to think, however, that validation plays a more important
role in most practical characterizations of XML parsers than is
reflected in the treatment of validation in this draft. (Search for
“XML parser” and read people's characterizations of their own or
others' parsers and I think you'll find that in about 90% of all
instances, one of the first things mentioned will be whether it's a
validating parser or not.) If W3C is going to issue a spec which
appears to be trying to provide a single set of processor categories
that most people and applications can use, then it says here that I
think it's an error for its categories not to address validation more
directly.

The current draft appears to define four categories of XML processing,
none of which require validation. In reality it defines eight: two
profiles forbidding validation and two profiles each of which can be
coupled with a rule forbidding, a rule allowing, or a rule requiring
validation. It defines more than eight if one takes schema languages
other than DTDs into consideration.

It's probably a good idea to make it feasible for other specs
appealing to this one to specify that validation is required, or
allowed, or forbidden, and otherwise to require conformance to one of
the named profiles. But it's a design flaw in this spec that writers
of other specs are required to look so hard for instructions on how to
do so. I predict that a half of all members of other working groups
that read this spec with a view toward referring normatively to it
will come away from a first reading with the belief that this spec
defines four profiles, all of them for non-validating processors, and
so these profiles cannot be required if validation is required, or
perhaps even if it's allowed. (“See? It says right here: 'process the
document as required of conformant non-validating processors ...'!”
“Oh, OK, I guess we can't require validation.”)

If you insist on keeping validation semi-orthogonal to the four
defined profiles, you might wish to add a non-normative appendix on
“How to refer normatively to this specification” with instructions
which say, in their bluntest form: (1) pick one or more profiles that
all processors of your language must conform to; (2) specify whether
processors of your language are allowed to conform to higher-level
profiles or not; (3) if you are using the External Declarations or
Full profile, specify whether processors of your language are (a)
required, (b) allowed but not required, or (c) forbidden to perform
DTD validation, and whether they should reject or continue to process
invalid input.

Otherwise, it's quite likely that readers from other working groups,
who have other things on their plate than thinking about the
subtleties of defining XML processor profiles, will forget that they
can and should say something about whether validation is allowed,
required, or forbidden. They will end up with incomplete
characterizations of their requirements. It can even happen to people
who ARE thinking about the subtleties of processor profiles: in
section 6 of this spec, for example, you offer a sample formulation of
a requirement to conform to a particular profile, as (I think) a model
for other specifications to use. But the example says nothing about
whether validation is required or allowed or forbidden.

All that said, however, I reiterate that the treatment of validation
is better in this draft than in the one I reviewed earlier.


Editorial comments

1 2 1. For “there are exists not only optionality in the XML
Recommendation itself” perhaps read “there are not only optional
features in the XML Recommendation”.

1 3 1. For “The Infoset” substitute a reference to the bibliographic
item, or perhaps read “The Infoset specification [XML Information
Set]”. It is not the infoset, or an infoset, but the Infoset
specification that gave the community a (or: yet another) vocabulary
for discussing the information passed by a parser to an application.

1 3 1. For “the items produced by a parser” perhaps read “the
information passed by an XML parser to an application”. (The
vocabulary defined by the Infoset spec does not apply to parsers in
general but only to XML parsers. And the term “item” has a technical
meaning in the context of parsing that is not at all what you mean
here. And outside the context of parsing it doesn't mean anything in
particular.

1.1 1 2. You say

    XML applications are often created by building on top of the [XML
    Information Set] vocabulary or XML data models such as [XML Path
    Language (XPath) Version 1.0] or [XQuery 1.0 and XPath 2.0 Data
    Model (XDM)], understood as the output of an XML processor.

The metaphor of building on top of a foundation seems to be dragging
this sentence down; if you insist on keeping it, I think you would do
well to speak here of the foundations provided by the [XML Information
Set] vocabulary and the XPath data models, instead of eliding the
foundation and pouring concrete directly on top of the specs. This
reader found himself trapped in wet concrete at this point.

But I think you might do better to replace the sentence quoted with
something like

    XML applications are often defined in terms of operations on
    instances of XML data models such as [XML Path Language (XPath)
    Version 1.0] or [XQuery 1.0 and XPath 2.0 Data Model (XDM)], or on
    information identified by terms in the [XML Information Set]
    vocabulary.

1.1 1 -1. For “if the input document includes uses of XInclude, for
instance.” read “if the input document includes uses of XInclude, for
instance, the XML processor may or may not perform the indicated
inclusions”. Or finish the sentence in another way. But finish the
sentence.

2 1 Substantive. It might be helpful to add the observation that

    Every conforming XML processor distinguishes, by definition,
    between XML and non-XML input.

2 3 -2. After “For example, a data model may expose element content as
an array of strings”, perhaps add “and not as an array of characters”?
I had to think for a minute or two to understand what you were driving
at here.

2 3 -2. In “For example, a data model may expose element content as an
array of strings”, you seem to be taking the position that a “data
model” can “expose” things. I think (perhaps I'm wrong) that “expose”
is more normally used of APIs, and that data models would more
naturally be said to “define” content as an array of string than to
“expose” it.

2.1-2.4. The phrase “information corresponding to information items
and properties [in a particular class]” is needlessly redundant. And
it says the same thing twice. Unless, of course, it's not saying the
same thing twice at all, but something subtly different which needs
explanation. Why are you spitting in your readers' faces this way?

If you are going to follow the Infoset spec's terminology, I think you
might do worse than follow that spec's usage. The Infoset spec speaks
of information sets being “made available” (by parsers, to downstream
applications), and also speaks of an information set as consisting of
some number of information items. That suggests (or so it seems to me)
that when a parser provides information to the application, the usage
of the Infoset spec is that the parser is providing information items
to the application. Providing information items — not “information
corresponding to information items”.

If you mean what the Infoset spec means, would you not do better to
follow its usage? They invented the term information item, after all.

If you mean something different, then (a) what do you mean? and (b)
why have you not defined the terms you are using in a special sense
different from that given by the Infoset spec?

2.3 list item 1 Typo. For “specifed” read “specified”. Ditto in 2.4
list item 1.

In 2.2 I found myself wondering whether conformant non-validating XML
processors are required to perform ID type assignment for IDs declared
in the internal DTD subset. It might be a convenience for readers who
don't have the relevant specs in their favorites list if you included
a note pointing out that they do, or do not, have that obligation.

4.2.3 1 -1. For

    ... may replace some (XInclude) Element Information Items ... with
    some amount of different information, corresponding to Element,
    Attribute, Character, Comment, Namespace and Processing
    Instruction Information Items

read

    ... may replace some (XInclude) Element Information Items ... with
    some number of different Element, Attribute, Character, Comment,
    Namespace and Processing Instruction Information Items.

If (as your usage seems to want to suggest) information items are not
simply pieces of information provided by the parser to the
application, then in this case your current formulation is
substantively wrong: XInclude processing operates on infosets and
produces infosets. But an infoset is, by definition, a set of
information items. If the processor replaces information items with
some kind of information which is not itself an information item but
only corresponds to an information item, then its output cannot be an
information set.

If, on the other hand, information items are simply pieces of
information provided by a parser to the application, then the current
wording seems to be seeking verbosity as its own reward.

3. I'm glad to see the spec making use of the terms
“implementation-defined” and “implementation-dependent”, but it's a
discourtesy to your reader not to define them in this specification.

You also don't actually use the term “implemenation-dependent”, so you
don't need to refer to it. If you do want to refer to it, and you want
your usage to align with that of QT, you will want to replace
“-determined” with “-dependent”.

2 (Important). It would be a lot easier to see the differences in item
1 of each profile if they didn't begin with the same eighty
characters. How many people scanning these lists for a first
orientation will even notice that beginning in character 83 or so, the
first items of the lists diverge from each other?

Since the obligation to “Process the document as required of
conformant non-validating XML processors” does not distinguish the
profiles from each other, it could be dropped, or moved outside the
lists of distinctive features, without loss. If you do insist on
including it in each list, repetitive and uninformative though it is,
then at least break the item in half. For example:

  1 Process the document as required of conformant non-validating XML
    processors;

  2 Refrain, in so doing, from reading any external markup
    declarations;

  3 Maintain the base URI of each element in conformance with [XML
    Base];

  ...

Passim. You still have a stray capital E or two in uses of the word
“element” that do not begin a sentence. If the phrase “element
information item” were always capitalized consistently, this would be
understandable, if unnecessarily ugly. But it's not capitalized
consistently, so that excuse is not available to you.


Status of my earlier comments

I've reviewed my earlier comments and summarize here what I believe to
be the state of play with respect to each of them.

1. Choice of facets for characterizing processors

Significantly improved. The introduction does a reasonable job of
explaining the rationale for the features selected.

In 1.1 3, I think the explicit statement that the profiles don't
address the preservation of invariants during modification or
incremental construction is helpful. I think it would also be helpful
to add in this section an explicit statement that the profiles don't
address the choice of API, memory model, or the distinction between
tree- and event-based interfaces.


2. Respect for the stand-alone declaration

Not addressed that I could see.


3. Validating processors

Partially addressed. My residual discomfort is discussed above.


4. Definitions of terms

Much improved. I notice that you have addressed this in large part by
deleting all usage of several of the terms I suggested should be
defined, rather than take the trouble to define them. But you seem to
have replaced them with words that I found clearer; the only terms I'd
still like to see defined here are “profile” and “processing”
(specifically of declarations).


5. Are the profiles disjoint?

Resolved. Thank you.


6. Identification of xml:id attributes as IDs

Resolved. Thank you.


7. Processing of external declarations

Not resolved that I can see.


8. Providing information items

Resolved. Thank you.


9. Data models and information sets

Resolved. Thank you.


10. Rigidity

Resolved. Thank you.


11. Relation of profiles to current practice

Not addressed as far as I can see.


12. Implementability of the spec

Resolved. Thank you.


13. Conformance clause

Resolved. Thank you.


14. Documentation of implementation-defined features

Resolved. Thank you.


15. The information expressed in XML documents

Not resolved. The current wording does its best to suggest that no
information not given a name by the Infoset spec can be provided to
applications. Suggested rewording:

For the profile definitions above and the invariants below, we define
a number of (overlapping) classes which categorize the information
items and their properties defined in [XML Information Set].


16. The information classes

Resolved. Thank you.


17. Recursive XInclude processing

Resolved. Thank you.


18. Minor editorial points, typos, etc.

Most resolved, some not. For your convenience, I repeat here the ones
for which I don't see a resolution and which have not already been
reiterated above.

  - In section 1, horizontal ellipses are used with whitespace between
    the full stops without whitespace before or after the ellipsis.

    For “a software module. . .used”, read “a software module
    ... used” or optionally “a software module … used” (the latter
    using the standard hellip entity for character U+2026).

  - In 1.1, the paragraph about base URI says the term is used “as it
    is defined in [RFC 3986]”. But RFC 3986 does not provide any
    definition properly so called for the term base URI. It specifies
    rules for establishing and using a base URI, but it does not
    “define” it.

    I think what is meant is that XPP assumes that the base URI is
    established and used as specified in RFC 3986. So perhaps read

    A base URI is an absolute URI against which relative URIs are
    applied; this specification assumes that base URIs are established
    and used as specified in [RFC 3986].

    But you should probably also decide whether XPP assumes it or
    requires it.

Once again, thank you for your work. Good luck with the document.


-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************

Received on Friday, 6 January 2012 03:09:47 UTC