Re: review of conformance section and conformance language from Tom Hillman on 2021-06-09 (public-ixml@w3.org from June 2021)

From: Tom Hillman <tom@expertml.com>
Date: Wed, 9 Jun 2021 14:54:40 +0100
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-ixml@w3.org, Steven Pemberton <steven.pemberton@cwi.nl>
Message-ID: <f21be5ec-174e-4c77-83a4-e95d62c77efa@Spark>
Indeed, thanks, Michael.

Forgive my lack of standards experience, but can I just check my understanding, given the following paragraph from Michael's email:
> I took the liberty of looking at the use of the verbs 'must', 'may', and 'shall' in the rest of the spec as well. It may feel too constraining to eliminate all uses of "must", "may", and "should" that do not relate to conformance, but over the years I have gradually come to the conclusion that not doing so is worse.

Is there a particular meaning to 'should' and 'shall'?  Or are these terms just to be avoided in favour of 'must' for mandatory conformance requirements, and 'may' for optional ones?
> A conforming parser must not accept non-conforming grammars.

I am a little troubled by this: it seems to imply that a parser ought to validate the grammar, and may not accept grammars other than iXML grammars.  I would prefer something to the effect of "A conforming parser may reject non-conforming grammars."  This puts the onus for writing a conforming grammar on the grammar author, which I think is correct.
> > We discussed this a bit; I have now done a little homework on the Unicode Consortium web site looking for the correct terminology. "Code point" denotes any number between 0 and x10FFFF inclusive, so my earlier idea that we could make "within the code-point range" exclude surrogates is a non-starter.
> Again I feel we should be consistent: either say caveat emptor, and let the author take care of what is produced, or enforce it.

I think that the Unicode specification calls non-surrogates 'scalar values' (§3.9 of http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf) but this does not seem like a term which will be familiar to many readers of the spec.

I think I agree with Steven here; the onus should be on the writer of the grammar in that a conforming grammar must not produce invalid XML when parsed - in fact, I think that the conformance of grammars ought to explicitly state that before the current list of conformance requirements (which, after all, are following from that principle).

I would suggest that the sub-requirement of avoiding surrogate code points could then be expressed similarly to: "The number represented in a hex encoding of a character must correspond to a unicode code point that can be represented as a well-formed XML character or entity".
> A processor conforms to this specification if it accepts grammars in ixml {or XML?} form

I feel we should allow the XML representation of a grammar as well/instead of the non-XML representation.  Perhaps something like: "A conforming parser must accept ixml grammars in either ixml or XML representations, or both."
>  • fail for whatever reason (e.g. because available resource limits were exceeded). {so a processor that always fails is conformant?}

Do we need to enumerate a list of possible fatal errors (and their codes)?
> Known parsing algorithms of this class include Earley, Unger, CYK, GLR, and GLL. {Should these be nonnormative references?}

I don't think we need this line to be a part of the spec.
>  • If more than one parse tree describes the input [...] the resulting parse must be marked as ambiguous by including the attributeixml:state="ambiguous" on the document element of the serialisation.

I'm sure you all already know this, but I would like users to have the option to suppress the ambiguous flag (or at least the option of an option) - not least because any grammar that allows for whitespace is likely to be ambiguous.  The ability to offer that option should not be required, but should be allowed.
>  • If the root node in the grammar is marked as an attribute, processors must ignore that marking when serialising the rule as the root.

Could we also make it acceptable to fail with an error?

Thanks again,
Tom

_________________
Tomos Hillman
eXpertML Ltd
+44 7793 242058
On 9 Jun 2021, 13:57 +0100, Steven Pemberton <steven.pemberton@cwi.nl>, wrote:
> Thank you for this Michael.
>
> I adopted all of this (up to the discussion section), with some slight changes to the wording. In particular I changed all "parsers" to "processors". You will want to rereview I expect.
>
> The result is visible at the usual location
>    https://homepages.cwi.nl/~steven/ixml/ixml-specification.html
>
> You will note some comments visible in yellow in the conformance section for discussion.
>
> > On some items, I think we need discussion.
> >
> > Q1. In "Parsing", the text currently says:
> >
> > The root symbol of the grammar is the name of the first rule in the grammar. If it is marked as hidden, all of its productions must produce exactly one non-hidden nonterminal and no non-hidden terminals before or after that nonterminal (in order to match the XML requirement of a single-rooted document).
> >
> > I am not sure what this is saying; I suspect it's one or more of the following.
> >
> > (a) If the root symbol of the grammar is hidden, the output of the parse is going to be a well formed XML document (with a single root) only if whatever rules match the input produce one element that encloses all others. So watch your step!
> >
> > (b) The parser is responsible for checking that the grammar will always produce a single-rooted XML document with a single outermost element, and flagging an error in the grammar if this is not guaranteed.
>
> It is indeed meant to require that a conformant grammar produce a conformant XML serialization. This is thus allowed:
>
>      -input: -~[]*, -"Last-modified: ", date, -~["0"-"9"], -~[]*.
>      date: y, -"-", m, -"-", d.
>      @y: n.
>      @m: n.
>      @d: n.
>      n: ["0"-"9"]+.
>
> so that any input that contains the string Last-modified: followed by a date is acceptable, and will produce a serialization like
>
>    <date y="2021" m="06" d="09"/>
>
> (and if the input contains more than one matching Last-modified:
>
>     <date ixml:state="ambiguous" y="2021" m="06" d="09"/>
>
> )
>
> But this wouldn't be allowed:
>
>     -input: -~[]*, -"Last-modified: ", date, -"T", time, -~["0"-"9"], -~[]*.
>
> because the serialization wouldn't have a root element.
>
> My feeling is that if we are going to require that serialised rule-names match XML names, then we should require the output to be valid XML. Or vice-versa.
>
> > Q2. In "The grammar" / "Terminals", for
> >
> > The number must be within the Unicode code-point range.
> >
> > perhaps read
> >
> > The number must be within the Unicode code-point range and should normally identify a code point of type Graphic or Private Use (informally: assigned Unicode characters, or code points in private-use areas). If necessary, encoded characters may identify code points of type Format, Control, or Reserved (i.e. unassigned). Encoded characters must not identify code points of type Surrogate or Noncharacter, which do not represent characters.
> >
> > We discussed this a bit; I have now done a little homework on the Unicode Consortium web site looking for the correct terminology. "Code point" denotes any number between 0 and x10FFFF inclusive, so my earlier idea that we could make "within the code-point range" exclude surrogates is a non-starter.
>
> Again I feel we should be consistent: either say caveat emptor, and let the author take care of what is produced, or enforce it.
>
> > Q3. In "Conformance" there is some redundancy and some difference between the first and last items in the bulleted list:
> >
> > - All rule names that are serialised must match the requirements for an XML name.
> > ...
> > - All nonterminal names which are marked to be serialised must match the requirements of an XML name.
>
> Agree, and I marked this out with a comment.
>
> Also:
>
>
> • For every nonterminal name occurring on the right-hand side of a rule, exactly one rule defining that name must exist in the grammar.
> • The grammar must not contain more than one rule defining any given name.
>
> If we dropped the second rule, you would be allowed to have more than one rule with the same name as long as it wasn't used.
>
> Steven
>
Received on Wednesday, 9 June 2021 13:56:11 UTC