Re: review of conformance section and conformance language from C. M. Sperberg-McQueen on 2021-06-09 (public-ixml@w3.org from June 2021)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Wed, 9 Jun 2021 11:40:31 -0600
To: Tom Hillman <tom@expertml.com>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-ixml@w3.org, Steven Pemberton <steven.pemberton@cwi.nl>
Message-Id: <0BF034BD-A517-46E1-B4D3-FC57319C956B@blackmesatech.com>
On 9,Jun2021, at 7:54 AM, Tom Hillman <tom@expertml.com> wrote:

> … Forgive my lack of standards experience, but can I just check my understanding, given the following paragraph from Michael's email:

>> I took the liberty of looking at the use of the verbs 'must', 'may', and 'shall' in the rest of the spec as well. It may feel too constraining to eliminate all uses of "must", "may", and "should" that do not relate to conformance, but over the years I have gradually come to the conclusion that not doing so is worse.

> Is there a particular meaning to 'should' and 'shall'?  Or are these terms just to be avoided in favour of 'must' for mandatory conformance requirements, and 'may' for optional ones?

In IETF, W3C, and ISO specs "should" indicates that what is described is not an absolute requirement, but that implementations are expected to behave as described unless there is a really good reason not to do so.  Since it is recognized that there may be such a really good reason, something expressed with "should" is not an absolute requirement of conformance.  At some level (is it required? is it allowed?) it is thus the equivalent of "may", but it signals the expectation that whatever is described will normally be the way the thing behaves.

In the words of the ISO drafting rules directive [1], it is used "to express recommendations"; synonyms offered for use when needed are "it is recommended that" and "ought to".

[1] https://www.iso.org/sites/directives/current/part2/index.xhtml

I checked for "shall" because in ISO specs, "shall" is used to express requirements; if something does not behave as described, it does not conform to the standard.  W3C and IETF specs normally use "must" in this sense, not "shall", presumably because "shall" sounded too legalistic and bureaucratic.  The result is that it requires some stylistic ingenuity in W3C and IETF specs to distinguish clearly between statements of what is required for conformance to the spec and what will always be the case due to external constraints unrelated to the spec.  I looked for "shall", but in fact the ixml spec text does not use that word. 

>> A conforming parser must not accept non-conforming grammars.

> I am a little troubled by this: it seems to imply that a parser ought to validate the grammar, and may not accept grammars other than iXML grammars.

Indeed it does imply that -- or rather, it is trying to say that directly and explicitly.  That was the decision I thought we made during the call.

When a conforming processor is presented with a non-conforming grammar, we can either require it to report an error, or specify that its behavior in that situation is undefined, meaning implementations are not required to detect or report errors in the grammar.  Or we can require certain kinds of error recovery (as in XSLT 1.0 and de facto in some versions of HTML), and take one or the other view (fail, undefined) if those error recovery steps don't work.

Specifying that behavior is undefined is a short road to incompatible implementations, as I think can be observed in the SQL space and in Web browsers.  Requiring that errors in the input be reported is a good way of helping users learn the rules, and a good way of improving interoperability.

If a processor P accepts not only ixml grammars, but also IETF ABNF grammars, then when it's handling an ABNF grammar the ixml spec can say either of two things:

  (a) P is behaving as a conforming ixml processor.
  
  (b) P is not behaving as a conforming ixml processor.

(There may be other things we can say, but at the moment these are the only two that seem useful.)

If we choose (a), and the user then tries the same grammar with processor Q and it fails, who gets to explain to the user that two conforming ixml processors are not guaranteed to accept the same grammars?

If we choose (b), it is not a statement that P is behaving badly, only a statement that the ixml spec does not provide rules for handling ABNF grammars.  If I make Aparecium handle other grammar notations, the documentation will note that when doing so, the user and I are outside the rules of the ixml spec; if the user wants a strictly conforming ixml implementation, set the --ixml-strict flag.  

> I would prefer something to the effect of "A conforming parser may reject non-conforming grammars."  This puts the onus for writing a conforming grammar on the grammar author, which I think is correct.

This is a design question which we should probably discuss, though there may be some resistance from those who like me think we did discuss it and reached a decision.  But if you are surprised by what's in the document, then we apparently reached only the illusion of consensus, not actual consensus, and we have no choice but to discuss it again.

>> We discussed this a bit; I have now done a little homework on the Unicode Consortium web site looking for the correct terminology. "Code point" denotes any number between 0 and x10FFFF inclusive, so my earlier idea that we could make "within the code-point range" exclude surrogates is a non-starter.

> Again I feel we should be consistent: either say caveat emptor, and let the author take care of what is produced, or enforce it.

I think I disagree; this is such an easy catch for software and such an error-prone subject for humans that I think it makes no sense to put the onus on the human.

But I am assuming that we expect the input to be a stream of UCS (Unicode / ISO 10646) characters.

If we expect ixml to be used to parse arbitrary binary data, or data in arbitrary characters sets, or data which confusing mixes character sets (so some octets with the high-order bit set are part of UTF sequences and some are native Windows characters ...), then we should perhaps not restrict the values at all, although we will have to define what it means when the user is parsing a stream of octets and the grammar says to expect #1A5, which cannot be the value of a single octet.  Does it match nothing?  Does it match the sequence #01, #A5?  If it's a little-endian stream should it match #A5, #01?

> I think that the Unicode specification calls non-surrogates 'scalar values' (§3.9 of http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf) but this does not seem like a term which will be familiar to many readers of the spec.

I believe that's true.  But I also believe that if the input is to be Unicode, it makes sense to forbid noncharacter code points as well as surrogates.

> I think I agree with Steven here; the onus should be on the writer of the grammar in that a conforming grammar must not produce invalid XML when parsed - in fact, I think that the conformance of grammars ought to explicitly state that before the current list of conformance requirements (which, after all, are following from that principle).

I think that you and Steven are both using "valid" where I would expect "well formed".  Please don't do that.  It confuses me a lot.

If Steven said the onus should be on the grammar writer, then I missed that entirely; my reading of his remarks has been that on most questions he leans the other way.  But in weighing options on the scale of helpful-or-maybe-paternalistic vs unconstrained-but-maybe-booby-trapped, everyone is entitled to lean now one way and now the other.

> I would suggest that the sub-requirement of avoiding surrogate code points could then be expressed similarly to: "The number represented in a hex encoding of a character must correspond to a unicode code point that can be represented as a well-formed XML character or entity".

That may require more grasp of the details of the XML spec than it's reasonable to expect from the average reader.  (I'd have to look at the XML spec to see whether the phrase "well-formed XML character or entity" made sense; off hand, I don't think I know exactly what it means.)  I think it's simpler to say "no surrogates, no noncharacter code points."

>> A processor conforms to this specification if it accepts grammars in ixml {or XML?} form

> I feel we should allow the XML representation of a grammar as well/instead of the non-XML representation.  Perhaps something like: "A conforming parser must accept ixml grammars in either ixml or XML representations, or both."

Good catch.

Design question, to be discussed, I think.  I think it makes sense to allow conforming processors to accept grammars in XML.  I'm not sure whether it makes sense to allow them to require XML grammars.

>> • fail for whatever reason (e.g. because available resource limits were exceeded). {so a processor that always fails is conformant?}

If failing for lack of resources or other external causes is non-conforming, there will be no conforming processors and conformance is not a helpful concept.

If conforming processors are allowed to fail for lack of resources or other external causes, then yes, a processor that always fails may be conformant.  The way I've heard people say it, "Some things have to be left to the marketplace."

> Do we need to enumerate a list of possible fatal errors (and their codes)?

That might feel a bit more bureaucratic than the rest of the spec.  But I don't think I would object.

>> Known parsing algorithms of this class include Earley, Unger, CYK, GLR, and GLL. {Should these be nonnormative references?}

> I don't think we need this line to be a part of the spec.

I think I'm agnostic.  I do like the explicit acknowledgement that the Earley algorithm is not required.

>> • If more than one parse tree describes the input [...] the resulting parse must be marked as ambiguous by including the attributeixml:state="ambiguous" on the document element of the serialisation.

> I'm sure you all already know this, but I would like users to have the option to suppress the ambiguous flag (or at least the option of an option) - not least because any grammar that allows for whitespace is likely to be ambiguous.  The ability to offer that option should not be required, but should be allowed.

I think that's a comment on the version of the spec on which I was commenting, and not on the changes I was proposing, in which the quoted sentence ends "unless the parser offers a user option to suppress this attribute and the user has activated that option."

>>  • If the root node in the grammar is marked as an attribute, processors must ignore that marking when serialising the rule as the root.

> Could we also make it acceptable to fail with an error?

I think that needs discussion.  I'm agnostic.  As a design decision, I think this ties in with other issues which require us to decide how much we want to protect users from themselves.

Michael


********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************
Received on Wednesday, 9 June 2021 17:41:58 UTC