Re: review of conformance section and conformance language from C. M. Sperberg-McQueen on 2021-06-09 (public-ixml@w3.org from June 2021)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Wed, 9 Jun 2021 11:08:46 -0600
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-ixml@w3.org
Message-Id: <415DE157-E407-4827-886A-EE24AADC2904@blackmesatech.com>
 

> On 9,Jun2021, at 6:54 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
> Thank you for this Michael.
> 
> I adopted all of this (up to the discussion section), with some slight changes to the wording. In particular I changed all "parsers" to "processors". You will want to rereview I expect.

I see changes 1-4, and the first change in 8, but not 5-7, the second part of 8, or 9-10.  Either you missed them or maybe they need discussion.

5 In “The grammar” / “Terminals”, for 

     It represents a single character and must be matched exactly in the input. 

 read 

     It represents a single character and matches that character in the input. 

 (Rationale: "must" indicates a conformance requirement, but what is being noted here is a fact about the meaning of quoted strings in the grammar, not a conformance requirement on grammars.) 

 6. In "Parsing", for 

     Grammars must be processed by an algorithm that accepts and parses any context-free grammar, and produces at least one parse of any input that conforms to the grammar starting at the root symbol. 

 read 

     Processors must accept and parse any conforming grammar and produce at least one parse of any input that conforms to the grammar starting at the root symbol. 

7. In the same section, for

    If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting parse must be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation.

read

    If more than one parse results, one is chosen; it is not defined how this choice is made, but (except as specified below) the resulting parse must be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation.  Parsers may provide a user option to suppress the ixml:state attribute; if the user selects that option, the attribute must not be included in the output.

8.  In "Conformance", for the list item

    The number represented in a hex encoding of a character must be within the Unicode character range. (This entails that the hex value must not be that of a surrogate code point.)

read

    The number represented in a hex encoding of a character must be within the Unicode character range and must not denote a Noncharacter or Surrogate code point.

9. In "Conformance" / "Conformance of parsers", in the first paragraph, which now reads

    A parse conforms to this specification if it accepts grammars in ixml form and uses those grammars to parse input and produce XML documents representing parse trees as specified elsewhere in this specification.

change "parse" to "parser" and add at the end:

    A conforming parser must not accept non-conforming grammars.

[Here the first change was made, but the sentence about non-conforming grammars was not added.  In the meantime, I think it’s clear that we need to discuss whether to add that sentence.]

10. In "Conformance" / "Conformance of parsers", after

    ... the resulting parse must be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation ...

add

    , unless the parser offers a user option to suppress this attribute and the user has activated that option.



> 
> The result is visible at the usual location
>    https://homepages.cwi.nl/~steven/ixml/ixml-specification.html
> 
> You will note some comments visible in yellow in the conformance section for discussion.
> 
> > On some items, I think we need discussion.
> > 
> > Q1. In "Parsing", the text currently says:
> > 
> > The root symbol of the grammar is the name of the first rule in the grammar. If it is marked as hidden, all of its productions must produce exactly one non-hidden nonterminal and no non-hidden terminals before or after that nonterminal (in order to match the XML requirement of a single-rooted document).
> > 
> > I am not sure what this is saying; I suspect it's one or more of the following.
> > 
> > (a) If the root symbol of the grammar is hidden, the output of the parse is going to be a well formed XML document (with a single root) only if whatever rules match the input produce one element that encloses all others. So watch your step!
> > 
> > (b) The parser is responsible for checking that the grammar will always produce a single-rooted XML document with a single outermost element, and flagging an error in the grammar if this is not guaranteed.
> 
> It is indeed meant to require that a conformant grammar produce a conformant XML serialization. This is thus allowed:
> 
>      -input: -~[]*, -"Last-modified: ", date, -~["0"-"9"], -~[]*.
>      date: y, -"-", m, -"-", d.
>      @y: n.
>      @m: n.
>      @d: n.
>      n: ["0"-"9"]+.
> 
> so that any input that contains the string Last-modified: followed by a date is acceptable, and will produce a serialization like
> 
>    <date y="2021" m="06" d="09"/>
> 
> (and if the input contains more than one matching Last-modified:
> 
>     <date ixml:state="ambiguous" y="2021" m="06" d="09"/>
> 
> )
> 
> But this wouldn't be allowed:
> 
>     -input: -~[]*, -"Last-modified: ", date, -"T", time, -~["0"-"9"], -~[]*.
> 
> because the serialization wouldn't have a root element.
> 
> My feeling is that if we are going to require that serialised rule-names match XML names, then we should require the output to be valid XML. Or vice-versa.

Two questions:

1 Is it clear that it is a decidable question (preferably easily decidable) whether a given nonterminal is guaranteed to produce exactly one element, with no preceding or following data characters?

Maybe it’s easier than it looks.  Any rule’s RHS is a regular expression, so it necessarily has one of the forms

    E = F* (no: can generate 0 elements)
    E = F+ (no: if F can generate 0 elements, so can E; if F can generate elements, E can generate many)
    E = F? (no:  Can produce zero elements.)
    E = F | G (answer:  true iff F and G each produce exactly one element and no text)
    E = F, G (answer:  true iff either F produces nothing and G produces exactly one element and no text, or vice versa)
    E is a nonterminal marked ^:  yes
    E is a nonterminal marked @:  yes (the mark @ is ignored if E produces the root element)
    E is a nonterminal N marked - : (answer:  true if the RHS of the rule for N produces exactly one element and no text)
    E is a terminal marked ^:  no (produces text and no element)
    E is a terminal marked - :  no (produces no text and no element)

This looks like a fairly straightforward recursive algorithm guaranteed to terminate.  So I think the answer to my question is yes, it’s decidable, and yes it’s easy, or at least not hard.

2 Do we not want to support the case of producing a well-balanced output, which could be used as an external entity?

At the moment, I am leaning towards “no, we don’t want to support that”, with the rationale that we are trying to keep things simple, easy to understand, and easy to use for users.  That means eliminating as many tricky cases as we can.  If I discover that I (or any imaginary users of my function library) need to relax this rule, I expect that I can implement a run-time option to turn off this and similar checks, to be called perhaps —no-hand-holding or —ygybyf (your gun, your bullet, your foot).  (Also on the list of things normally forbidden that would be allowed when the user specifies the —ygybyf option:  the limit of one rule per nonterminal; the requirement that nonterminals be defined; the requirement that all input be consumed; the requirement that hex values obey whatever constraints we end up putting on them, …  When run with the —ygybyf option. Aparecium will not be a conforming ixml processor, because it will accept grammars which are meaningful but non-conforming.

But I am interested in what others think.

> 
> > Q2. In "The grammar" / "Terminals", for 
> > 
> > The number must be within the Unicode code-point range. 
> > 
> > perhaps read 
> > 
> > The number must be within the Unicode code-point range and should normally identify a code point of type Graphic or Private Use (informally: assigned Unicode characters, or code points in private-use areas). If necessary, encoded characters may identify code points of type Format, Control, or Reserved (i.e. unassigned). Encoded characters must not identify code points of type Surrogate or Noncharacter, which do not represent characters. 
> > 
> > We discussed this a bit; I have now done a little homework on the Unicode Consortium web site looking for the correct terminology. "Code point" denotes any number between 0 and x10FFFF inclusive, so my earlier idea that we could make "within the code-point range" exclude surrogates is a non-starter.
> 
> Again I feel we should be consistent: either say caveat emptor, and let the author take care of what is produced, or enforce it.

I agree that consistency is a good goal.  I’ve wavered back and forth, and now lean (as described above) towards including checks and leaving caveat emptor to a run-time option.  (Maybe —caveat-emptor would be a good name for the option.  Or —no-seatbelts.)

> 
> > Q3. In "Conformance" there is some redundancy and some difference between the first and last items in the bulleted list:
> > 
> > - All rule names that are serialised must match the requirements for an XML name.
> > ...
> > - All nonterminal names which are marked to be serialised must match the requirements of an XML name.
> 
> Agree, and I marked this out with a comment.

OK.

> 
> Also: 
> 
>  • For every nonterminal name occurring on the right-hand side of a rule, exactly one rule defining that name must exist in the grammar.
>  • The grammar must not contain more than one rule defining any given name. 
> If we dropped the second rule, you would be allowed to have more than one rule with the same name as long as it wasn't used.

Ah, right.  Good catch.

Michael

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************
Received on Wednesday, 9 June 2021 17:09:29 UTC