review of conformance section and conformance language

In this morning's meeting I took an action to review the revised conformance clause in the spec.  In short, I think it looks pretty good.

I took the liberty of looking at the use of the verbs 'must', 'may', and 'shall' in the rest of the spec as well.    It may feel too constraining to eliminate all uses of "must", "may", and "should" that do not relate to conformance, but over the years I have gradually come to the conclusion that not doing so is worse.

This mail has two parts:  (1) changes I recommend which I don't think need discussion, and (2) questions we need to discuss.


Here is a list of changes I recommend:

1.  In "How it works" / "Terminals", for

    A terminal may not be marked as an attribute.

read

    A terminal must not be marked as an attribute.

(Rationale: "may not" is syntactically ambiguous and can be interpreted either as synonymous with "must not" and forbidding what is described or as synonymous with "may" and denying that what is described is a requirement of conformance.)


2. In "Conformance" / "Conformance of grammars", for

    ... the requirements that go beyond what is expressed in the grammar itself may be summarized as follows. 

read

    ... the requirements that go beyond what is expressed in the grammar itself can be summarized as follows. 

(Rationale:  we just said a few paragraphs back that "may" is used to describe optional features.)


3. In "The grammar" / "Terminals", for

    A quoted string must be exactly matched in the input.

read

    A quoted string matches only an occurrence of the exact same string in the input.

(Rationale:  "must" indicates a conformance requirement, but conformance does not apply to input documents.)


4. In the same section, for

    {all characters, quotes must be doubled}

read (both times)

    {all characters, quotes are doubled}

(Rationale:  "must" indicates a conformance requirement, but what is being noted here is a fact about how literals will be parsed by a conforming parser, not a conformance requirement on grammars.)


5.  In the same section, for

    It represents a single character and must be matched exactly in the input.

read

    It represents a single character and matches that character in the input.

(Rationale:  "must" indicates a conformance requirement, but what is being noted here is a fact about the meaning of quoted strings in the grammar, not a conformance requirement on grammars.)


6.  In "Parsing", for

    Grammars must be processed by an algorithm that accepts and parses any context-free grammar, and produces at least one parse of any input that conforms to the grammar starting at the root symbol.

read

    Parsers must accept and parse any conforming grammar and produce at least one parse of any input that conforms to the grammar starting at the root symbol.

Perhaps add

    Any algorithm which handles any context-free grammar may be used.

but I think that strictly speaking that follows from the fact that we don't prescribe any algorithm.

(Rationale:  this 'must' expresses a conformance requirement on parsers, not grammars; make parsers the subject of the verb.)


7. In the same section, for

    If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting parse must be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation.

read

    If more than one parse results, one is chosen; it is not defined how this choice is made, but (except as specified below) the resulting parse must be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation.  Parsers may provide a user option to suppress the ixml:state attribute; if the user selects that option, the attribute must not be included in the output.


8.  In "Conformance", for the list item

    The number represented in a hex encoding of a character must be within the Unicode character range. (This entails that the hex value must not be that of a surrogate code point.)

read

    The number represented in a hex encoding of a character must be within the Unicode character range and must not denote a Noncharacter or Surrogate code point.


9. In "Conformance" / "Conformance of parsers", in the first paragraph, which now reads

    A parse conforms to this specification if it accepts grammars in ixml form and uses those grammars to parse input and produce XML documents representing parse trees as specified elsewhere in this specification.

change "parse" to "parser" and add at the end:

    A conforming parser must not accept non-conforming grammars.


10. In "Conformance" / "Conformance of parsers", after

    ... the resulting parse must be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation ...

add

    , unless the parser offers a user option to suppress this attribute and the user has activated that option.







On some items, I think we need discussion.

Q1.  In "Parsing", the text currently says:

    The root symbol of the grammar is the name of the first rule in the grammar.  If it is marked as hidden, all of its productions must produce exactly one non-hidden nonterminal and no non-hidden terminals before or after that nonterminal (in order to match the XML requirement of a single-rooted document).

I am not sure what this is saying; I suspect it's one or more of the following.

(a) If the root symbol of the grammar is hidden, the output of the parse is going to be a well formed XML document (with a single root) only if whatever rules match the input produce one element that encloses all others.  So watch your step!

(b) The parser is responsible for checking that the grammar will always produce a single-rooted XML document with a single outermost element, and flagging an error in the grammar if this is not guaranteed.

I think point (a) is worth making, but (b) worries me, because I am not certain I know how to check the property.  Interpretation (b) also seems at odds with the fact that elsewhere ixml does not attempt to guarantee that the results will be legal XML:  we don't, for example, require that non-terminal names be legal XML names, or that non-terminal names which are not legal XML names be marked with - to ensure they are never serialized.  Also, I think I can imagine cases where what I want is output that would be a legal external entity in XML, so the requirement is that it be well balanced but not that it have a single outermost element.

So I think I would like to propose for discussion a change of the wording above to:

    The root symbol of the grammar is the name of the first rule in the grammar.  Note that if it is marked as hidden, the output of the parser will be a single-rooted XML document only if, for the given input, the parse tree includes some non-hidden nonterminal which contains all the other non-hidden terminals and nonterminals.

Q2.  In "The grammar" / "Terminals", for 

    The number must be within the Unicode code-point range. 

perhaps read 

    The number must be within the Unicode code-point range and should normally identify a code point of type Graphic or Private Use (informally:  assigned Unicode characters, or code points in private-use areas).  If necessary, encoded characters may identify code points of type Format, Control, or Reserved (i.e. unassigned).  Encoded characters must not identify code points of type Surrogate or Noncharacter, which do not represent characters. 

We discussed this a bit; I have now done a little homework on the Unicode Consortium web site looking for the correct terminology.  "Code point" denotes any number between 0 and x10FFFF inclusive, so my earlier idea that we could make "within the code-point range" exclude surrogates is a non-starter.


Q3.  In "Conformance" there is some redundancy and some difference between the first and last items in the bulleted list:

    - All rule names that are serialised must match the requirements for an XML name.
    ...
    - All nonterminal names which are marked to be serialised must match the requirements of an XML name.

We should choose one of these and delete the other.  If we choose the final one, then the normative prose elsewhere in the spec should be tightened; currently it says only that "it is the grammar author's responsibility to ensure that all serialised names match the requirements for an XML name" (serialised, not serialisable), and "It is an error if the name of a node to be output does not match the requirements of an XML name."



-Michael


********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************

Received on Tuesday, 8 June 2021 18:18:15 UTC