Comments on XPP draft o2 24 January 2013 (was: Re: XML processor profiles)

[Second transmission; with attachment]

On Jan 31, 2013, at 9:08 AM, Norman Walsh wrote:

> Hi Michael,
> 
> ...
> 
> I now believe we have addressed [your comments] to the best of our ability.
> 
> The latest draft is here:
> 
> http://www.w3.org/XML/XProc/docs/xml-proc-profiles.html
> 
> I hope you find this new draft entirely satisfactory. If you could
> take a look and let me know, I would certainly appreciate it.


Hi Norm,

I apologize for the length of time it has taken me to respond.  I
attach an HTML document summarizing the status of my earlier
comments as well as I could and identifying those which have been
resolved to my satisfaction and those which have not.

An ASCII version is appended for the use of those who prefer it.

Michael
Comments on 24 January 2013 draft of

XML processor profiles

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

20 August 2013

1 New issue
2 Issues resolved
3 Issues partially resolved
   3.1 Validation
   3.2 Classification facets
   3.3 Miscellaneous small points
4 Issues not addressed
   4.1 Correctness of standalone processing
   4.2 Hamming distance of item 1 in the profiles
   4.3 Information corresponding to information items
   4.4 Definitions
   4.5 Relation of profiles to current practice
   4.6 The information expressed in XML documents

This document reviews the current status of comments made by this
reviewer on earlier drafts (12 April 2011 and 16 December 2011) of the
XML processor profiles specification; it takes the 24 January 2012
editor's draft as its reference point. Specifically, this document
records, for each item in the comments of 5 January 2012 (email), and
for those items in the comments of 15 April 2011 (email) which had not
yet been addressed in January 2012, whether the issue raised appears
to have been addressed or not, and if addressed whether it has been
resolved or not in the view of the reviewer, in the draft of 24
January 2013.

A few remarks on the new draft are also included, but this document
does not reflect a complete review of the current draft.


1 New issue

The following issue appears to be independent of those raised in
January 2012 and April 2011.

Sec 3, under "Unparsed Entity Reference Information Item", the draft
says in a note: This type of information item will not occur at all if
standalone="yes". There are two problems. First, this type of
information item will not occur if standalone="yes" is correctly
supplied; it may well occur in input in which standalone="yes" is
erroneously specified. Second, the note is missing a final full stop.


2 Issues resolved

The following items raised in January 2012 appear to have been
resolved in the draft of January 2013.

  - All the comments labeled editorial, except as noted below. Thank
    you.

  - Suggestion to add a remark to section 2.1.

    The review of January 2012 suggested adding the following text to
    section 2.1:

    Every conforming XML processor distinguishes, by definition, between XML and non-XML input.

    This text has not been added, but it no longer seems necessary or
    particularly helpful to the reviewer, so this issue should count
    as resolved.

    The following item raised in April 2011 appears to have been
    resolved in the draft of January 2013.

    Issue regarding the alleged definition of base URI by RFC 3986. I
    think the new definition of this term works correctly.

3 Issues partially resolved

The following items raised in January 2012 and/or April 2011 appear
not to have been resolved successfully in the draft of January 2013.

3.1 Validation

In both earlier reviews, this reviewer expressed a desire for a
cleaner and more explicit treatment of validation. The review of
January 2012 said (in part):

    I continue to think, however, that validation plays a more important
    role in most practical characterizations of XML parsers than is
    reflected in the treatment of validation in this draft. (Search
    for “XML parser” and read people's characterizations of their own or
    others' parsers and I think you'll find that in about 90% of all
    instances, one of the first things mentioned will be whether it's a
    validating parser or not.) If W3C is going to issue a spec which
    appears to be trying to provide a single set of processor categories
    that most people and applications can use, then it says here that I
    think it's an error for its categories not to address validation more
    directly.

    The current draft appears to define four categories of XML processing,
    none of which require validation. In reality it defines eight: two
    profiles forbidding validation and two profiles each of which can be
    coupled with a rule forbidding, a rule allowing, or a rule
    requiring validation. It defines more than eight if one takes
    schema languages other than DTDs into consideration.

The new non-normative section 7 is an improvement, in that it
addresses the related issues more clearly and makes it less likely
that other WGs seeking to refer to this specification for a normative
statement of requirements for their XML input will botch the job.

It continues to trouble me, however, that a spec whose avowed aim is
to allow other specifications to "establish precisely what input
processing they require" (my emphasis) goes out of its way to make it
onerous to specify validation. The only two profiles which can be
referred to simply, without an additional clause specifying whether
validation is forbidden, required, or allowed, are the two profiles
which forbid validation. This is bad social engineering, unless your
goal is to discourage users of XPP from using it to require validation
of any kind. If the goal is to be neutral with respect to validation,
the current draft fails to achieve it, by a wide margin.

The new section 7 also makes clear that this specification, in its
current form, does not succeed in addressing the requirement which
drove the creation of the working group, even in an attenuated form.
Those who attended the W3C's XML Processing Model workshop in 2001
will recall that some prominent attendees hoped for a specification
which would nail down, once and for all, the sequence in which
processes like validation, XInclude processing, etc., should be
performed. The minutes of the closing session can usefully be
consulted by those who have forgotten.

The working group has quite rightly declined to draft any spec of the
kind initially envisaged, and I don't want to reopen that discussion.
But the very least that this specification should do towards
satisfying that initial requirement is provide a short name for one or
more answers to the original question. This the current draft fails to
do.

It is the failure of the WG to integrate validation adequately into
the classification scheme — or perhaps it would be more accurate to
say the WG's determined efforts to exclude validation from the
processor profiles — which has rendered the spec unable even to put a
name to an answer to the question raised at the workshop.

This issue must therefore still be regarded as not satisfactorily
resolved.


3.2 Classification facets

The review of April 2011 asked for an explicit rationale for the
facets chosen as basis for classification, and for an explicit
acknowledgement that some possible facets are not used in the
classification offered.

    Suggested fix: explicitly acknowledge that XPP involves a choice among
    possible ways of characterizing processors; identify the processor
    properties used as the basis for the classification proposed and
    identify at least some potential properties which are not used in the
    classification. Explain the basis for the choice.

The review of January 2012 noted progress on the issue but noted a
remaining gap:

    Significantly improved. The introduction does a reasonable job of
    explaining the rationale for the features selected.

    In 1.1 3, I think the explicit statement that the profiles don't
    address the preservation of invariants during modification or
    incremental construction is helpful. I think it would also be helpful
    to add in this section an explicit statement that the profiles don't
    address the choice of API, memory model, or the distinction
    between tree- and event-based interfaces.

I continue to think that explicit exclusion of facets not used in the
classification is a useful way of documenting the design. (It would
also have the salutary form of forcing the WG to ask itself if it
really wants to publish a classification for XML processors that fails
to address salient distinctions like that between tree- and
event-based interfaces. Whom are you trying to serve? No one I know
who is interested in characterizing XML processors, that's for sure.)
The working group may be able to persuade me otherwise, but only by
discussing the issue.


3.3 Miscellaneous small points

Several small points have been addressed in part in the current draft,
though they have not been resolved to this reviewer's complete
satisfaction.

- Thank you for suppressing the space between the full stops in your
ellipses; it improves legibility somewhat.

I continue to think that most manuals of style prescribe white space
before an ellipsis, and you have not made me think differently.
Chicago (13), for example, says “ellipsis points … are usually
separated from each other and from the text and any contiguous
punctuation by 3-to-em spaces” — I think a blank character comes
closer to this, in an average Web browser, than no space at all. And
use of the public entity hellip (U+2026) will produce better spacing
between the dots than three literal full stops.

- Thank you for including definitions of the terms
implementation-defined and implementation-dependent; these take the
awkward form "The term implementation-defined indicates an aspect that
may differ between implementations ...". Perhaps better "an aspect of
processor behavior"?

- In January 2012 this reviewer wrote:

    In 2.2 I found myself wondering whether conformant non-validating
    XML processors are required to perform ID type assignment for IDs
    declared in the internal DTD subset. It might be a convenience for
    readers who don't have the relevant specs in their favorites list if
    you included a note pointing out that they do, or do not, have that
    obligation.

The current draft adds a note which helps address this question, but
it's ill-drafted: it reads:

    This profile, like the 2.1 The basic XML processor profile, reads
    only declarations in the internal subset, this means that types, such
    as ID, that appear in declarations in the internal subset will be
    processed while such declarations in the external subset will not.

First, oughtn't it to be processors, not profiles, which read things?
Second, types, such as ID and declarations are not really parallel. I
think what is meant is something like: "Processors conforming to this
profile, like those conforming to 2.1 The basic XML processor profile,
read only declarations in the internal subset of the DTD, not those in
the external subset. In consequence, declarations specifying that
attributes have type ID will be processed if they appear in the
internal subset, but such declarations will not be processed if they
appear in the external subset."

- Thank you for making your capitalization of the terms element
information item (etc.) consistent.

I do wish you had done so by moving in the direction of normal English
usage. As an innovation, returning to seventeenth-century
capitalization rules lacks charm.

Leaving normal English usage aside, I do not understand why you choose
to deviate from the usage of the Infoset spec, which lowercases these
terms except in section titles and other passages using title case.


4 Issues not addressed

The following items raised in April 2011 and/or January 2012 appear
not to have been addressed at all in the draft of January 2013.


4.1 Correctness of standalone processing

In April 2011, this reviewer wrote:

    It would be helpful, I think, for the processor profiles to
    distinguish more carefully the different behaviors possible with
    regard to the stand-alone declaration in the input XML document.

      - All declarations are read and handled appropriately, so
        documents with standalone='no' are processed without
        information loss.

      - No external declarations are read if standalone='yes'; if
        standalone='no' then external declarations are read, so all
        documents are processed without information loss.

      - No external declarations are read; if standalone='yes', the
        document is processed without information loss, and if
        standalone='no', the processor signals an inability to process
        the document without the possibility of information loss.

      - No external declarations are read, so documents with
        standalone='yes' are processed without information loss, and
        information will typically be lost in the processing of
        documents with standalone='no'. (Since documents may have
        standalone='no' even if standalone='yes' would be permitted,
        there can be cases where no information is lost in practice.)

    In particular, it would be helpful for users of XML and for
    writers of specifications for XML-based processing to distinguish the
    last case from the others, in order to exclude it.

    Suggested fix: augment the basic profile to require either that
    external declarations be read when necessary or that the processor
    signal an inability to handle non-standalone documents properly.
    Optionally also keep the profile now called basic, giving it a new
    name (personally, I could go for “sub-optimal”, but some people might
    think that that name was ungenerous).

In January 2012 I wrote on the same topic:

    I note with some disappointment that the current draft fails, like
    the earlier one, to take into account the property of respecting, or
    failing to respect, the implications of the standalone property; I
    think that's a lost opportunity.

I repeat these comments in full because I see no traces of any effort
to resolve this issue: no changes in the specification, no discussion
of the issue with the reviewer.


4.2 Hamming distance of item 1 in the profiles

The comment made in January 2012 about item 1 of the four profiles
still stands.

  - (Important). It would be a lot easier to see the differences in
    item 1 of each profile if they didn't begin with the same eighty
    characters. How many people scanning these lists for a first
    orientation will even notice that beginning in character 83 or so, the
    first items of the lists diverge from each other?

    Since the obligation to “Process the document as required of
    conformant non-validating XML processors” does not distinguish the
    profiles from each other, it could be dropped, or moved outside the
    lists of distinctive features, without loss. If you do insist on
    including it in each list, repetitive and uninformative though it is,
    then at least break the item in half. For example:

      - Process the document as required of conformant non-validating
        XML processors;

      - Refrain, in so doing, from reading any external markup
        declarations;
    
      - Maintain the base URI of each element in conformance with [XML
        Base];

    ...

If there is a goal whose achievement is aided by making the profiles
harder to scan, I do not know what it is. If there is a problem with
making it easier to see where the profiles are the same and where they
differ, I do not know what it is. The WG's consistent policy of
avoiding discussion with reviewers has not made it easier to
understand the WG's position and has not managed to persuade me that
the current wording is better than the alternative suggested in 2012.


4.3 Information corresponding to information items

The comments made in April 2011 and January 2012 still apply.

April 2011:

    In 2, the clauses about faithful provision of the information in
    the document all take the form “Faithful provision of the information
    ... corresponding to information items and properties ...”.

    Perhaps it would suffice to provide, or expose, the information
    items and properties specified.

    If it is absolutely necessary to provide not the information items and
    properties themselves but instead information corresponding to (but,
    implicitly, not identical to?) the specified items and properties,
    then I think the spec has an obligation to explain clearly what the
    difference is, and why exposing the items and properties does not
    satisfy the requirements of the spec. In particular, you need to
    provide an answer to the reader who is asking “How can a piece of
    information correspond to an information item without being
    indistinguishable from it (qua information) and thus without being
    that information item?”

    The editors might do well to review their dusty copies of Strunk
    and White's Elements of style, especially the maxim “Omit needless
    words”, and to revise accordingly. If they do, the individuals
    corresponding to their readers will feel an emotion corresponding to
    gratitude. (Or, at least, a diminished desire to seek out sharp
    objects and perform dangerous acts with them.)

January 2012:
    
    2.1-2.4. The phrase “information corresponding to information
    items and properties [in a particular class]” is needlessly redundant.
    And it says the same thing twice. Unless, of course, it's not
    saying the same thing twice at all, but something subtly different
    which needs explanation. Why are you spitting in your readers'
    faces this way?

    If you are going to follow the Infoset spec's terminology, I think
    you might do worse than follow that spec's usage. The Infoset spec
    speaks of information sets being “made available” (by parsers, to
    downstream applications), and also speaks of an information set as
    consisting of some number of information items. That suggests (or so
    it seems to me) that when a parser provides information to the
    application, the usage of the Infoset spec is that the parser is
    providing information items to the application. Providing information
    items — not “information corresponding to information items”.

    If you mean what the Infoset spec means, would you not do better
    to follow its usage? They invented the term information item,
    after all.

    If you mean something different, then (a) what do you mean? and
    (b) why have you not defined the terms you are using in a special
    sense different from that given by the Infoset spec?

In most other parts of the specification, the English relating to
processors, information, and information items has become clearer and
more natural. The locution "information corresponding to information
items" persists only here. As far as this reviewer can tell, all the
arguments brought forward in the earlier reviews continue to apply:
the infoset spec continues to say what it said when first published,
and the unnecessary and pointless obfuscation in this phrase continues
to make me want to poke my eyes out with a sharp stick. Is there any
argument in favor of the current wording? Or is the WG still under the
influence of those who prefer to speak as if the infoset spec defined
an API (and thus a particular format for infomration) instead of
defining named packets of information, independent of format?


4.4 Definitions

The spec appears still to lack definitions for key terms, including
most prominently profile and processing (specifically processing of
declarations).

This means that the issues originally raised as issue 4 and issue 7 in
the review of April 2011 remain unresolved. If the WG's belief is that
the reference to the XML spec suffices as a gloss of the words
"reading and processing all external markup declarations", then I
regret to inform you that this reader does not find any useful
distinction between reading and processing in that document. What in
the world do you think these words mean?


4.5 Relation of profiles to current practice

The comment originally made in April 2011 continues to apply.

    When profiles are defined for a new specification, they involve
    predictions about which kinds of variation in processor behavior are
    likely to be interesting and useful to developers and users. In the
    case of new specifications, there is no existing practice that
    could be appealed to as a justification of the classification or
    profiles, or to provide examples of software fitting one profile
    or another.

    That is not the case here, and I think the specification should
    not progress until an empirical survey of existing processor
    characteristics is performed, as a simple way of field-testing the
    profiles defined here for applicability in the real world and of
    clarifying the intent of the profiles by providing examples, where
    applicable, of existing interfaces or processors that satisfy the
    profile.

    In particular, I could have sworn (but I am too lazy to look it up
    now) that I had used some parser interfaces which did not provide
    access to namespace prefixes, and other interfaces which provided only
    inconvenient access to namespace names. Is a set of profiles which
    assumes that namespace name, local name, and prefix are always all
    three provided a good match for a world in which some parser
    implementors give their users a choice (prefix plus local name or
    namespace name plus local name)?

    Note that actually classifying real parsers will require a crisp
    definition of what it means to make a particular information item
    available; that will be a good thing, although it is likely to involve
    some work.

    Suggested fix: Identify ten or twenty existing XML processors with
    different behaviors (for purposes of this exercise, all conforming SAX
    processors may well turn out to be alike; ditto for conforming DOM
    parsers). Using the definitions given in XPP, identify which
    profile(s) each parser matches, if any. If there are significant
    numbers of parsers which match no profile, consider whether the
    profiles need to be revised to provide a better connection with
    existing practice. Use a non-normative document to provide examples of
    processors matching the different profiles.

I continue to believe that this specification should not advance until
its use in classifying actual processors is tested as described above.
And since revisions of the spec since April 2011 have identified an
explicit goal of supporting other specifications, the field-testing
describe above should be supplemented by a similar field-test, showing
that current specifications based on XML (W3C or other) could have
used the classification offered in this specification to simplify
their wording.

I think establishing the applicability of this specification's
classification to existing processors and existing specs should be a
pre-requisite for advancement beyond Candidate Recommendation. Please
consider this a formal objection to any plan to do otherwise (to the
extent that a member of the public has standing to raise formal
objections).


4.6 The information expressed in XML documents

Issue 15 in the review of April 2011 remains unresolved and
unaddressed.

    Section 3 begins

        For the profile definitions above and the invariants below, we
        categorize the information expressed in XML documents into a
        number of (overlapping) classes.

    This is incorrect. What is characterized in section 3 is not
    the information expressed in an XML document, but the
    particular subset of that information for which the Infoset
    spec defines names. The two are the same neither in theory nor
    in practice.

    Suggested fix: Replace the sentence quoted with one that's not
    false. Perhaps “For the profile definitions above and the
    invariants below, we categorize the information identified and
    named in [XML Information Set] into a number of (overlapping)
    classes.”

The comments of January 2012 also continue to apply.

    Not resolved. The current wording does its best to suggest that no
    information not given a name by the Infoset spec can be provided
    to applications. Suggested rewording:

        For the profile definitions above and the invariants below, we
        define a number of (overlapping) classes which categorize the
        information items and their properties defined in [XML
        Information Set].


-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************

Received on Tuesday, 20 August 2013 20:27:09 UTC