Re: review of conformance section and conformance language from Steven Pemberton on 2021-06-09 (public-ixml@w3.org from June 2021)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Wed, 09 Jun 2021 12:54:27 +0000
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-ixml@w3.org
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Message-Id: <1623240355321.147839777.2770418418@cwi.nl>
Thank you for this Michael.


I adopted all of this (up to the discussion section), with some slight 
changes to the wording. In particular I changed all "parsers" to 
"processors". You will want to rereview I expect.


The result is visible at the usual location
    https://homepages.cwi.nl/~steven/ixml/ixml-specification.html


You will note some comments visible in yellow in the conformance section 
for discussion.

 > On some items, I think we need discussion.
 >
 > Q1. In "Parsing", the text currently says:
 >
 > The root symbol of the grammar is the name of the first rule in the 
grammar. If it is marked as hidden, all of its productions must produce 
exactly one non-hidden nonterminal and no non-hidden terminals before or 
after that nonterminal (in order to match the XML requirement of a 
single-rooted document).
 >
 > I am not sure what this is saying; I suspect it's one or more of the 
following.
 >
 > (a) If the root symbol of the grammar is hidden, the output of the parse 
is going to be a well formed XML document (with a single root) only if 
whatever rules match the input produce one element that encloses all 
others. So watch your step!
 >
 > (b) The parser is responsible for checking that the grammar will always 
produce a single-rooted XML document with a single outermost element, and 
flagging an error in the grammar if this is not guaranteed.


It is indeed meant to require that a conformant grammar produce a 
conformant XML serialization. This is thus allowed:


      -input: -~[]*, -"Last-modified: ", date, -~["0"-"9"], -~[]*.
      date: y, -"-", m, -"-", d.
      @y: n.
      @m: n.
      @d: n.
      n: ["0"-"9"]+.


so that any input that contains the string Last-modified: followed by a 
date is acceptable, and will produce a serialization like


    <date y="2021" m="06" d="09"/>


(and if the input contains more than one matching Last-modified:


     <date ixml:state="ambiguous" y="2021" m="06" d="09"/>


)


But this wouldn't be allowed:


     -input: -~[]*, -"Last-modified: ", date, -"T", time, -~["0"-"9"], 
-~[]*.


because the serialization wouldn't have a root element.


My feeling is that if we are going to require that serialised rule-names 
match XML names, then we should require the output to be valid XML. Or 
vice-versa.

 > Q2. In "The grammar" / "Terminals", for
 >
 > The number must be within the Unicode code-point range.
 >
 > perhaps read
 >
 > The number must be within the Unicode code-point range and should 
normally identify a code point of type Graphic or Private Use (informally: 
assigned Unicode characters, or code points in private-use areas). If 
necessary, encoded characters may identify code points of type Format, 
Control, or Reserved (i.e. unassigned). Encoded characters must not 
identify code points of type Surrogate or Noncharacter, which do not 
represent characters.
 >
 > We discussed this a bit; I have now done a little homework on the 
Unicode Consortium web site looking for the correct terminology. "Code 
point" denotes any number between 0 and x10FFFF inclusive, so my earlier 
idea that we could make "within the code-point range" exclude surrogates is 
a non-starter.


Again I feel we should be consistent: either say caveat emptor, and let the 
author take care of what is produced, or enforce it.

 > Q3. In "Conformance" there is some redundancy and some difference 
between the first and last items in the bulleted list:
 >
 > - All rule names that are serialised must match the requirements for an 
XML name.
 > ...
 > - All nonterminal names which are marked to be serialised must match the 
requirements of an XML name.


Agree, and I marked this out with a comment.


Also: 


For every nonterminal name occurring on the right-hand side of a rule, 
exactly one rule defining that name must exist in the grammar.
The grammar must not contain more than one rule defining any given name. 
If we dropped the second rule, you would be allowed to have more than one 
rule with the same name as long as it wasn't used.

Steven
Received on Wednesday, 9 June 2021 12:55:26 UTC