- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Wed, 09 Jun 2021 12:54:27 +0000
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-ixml@w3.org
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Message-Id: <1623240355321.147839777.2770418418@cwi.nl>
Thank you for this Michael. I adopted all of this (up to the discussion section), with some slight changes to the wording. In particular I changed all "parsers" to "processors". You will want to rereview I expect. The result is visible at the usual location https://homepages.cwi.nl/~steven/ixml/ixml-specification.html You will note some comments visible in yellow in the conformance section for discussion. > On some items, I think we need discussion. > > Q1. In "Parsing", the text currently says: > > The root symbol of the grammar is the name of the first rule in the grammar. If it is marked as hidden, all of its productions must produce exactly one non-hidden nonterminal and no non-hidden terminals before or after that nonterminal (in order to match the XML requirement of a single-rooted document). > > I am not sure what this is saying; I suspect it's one or more of the following. > > (a) If the root symbol of the grammar is hidden, the output of the parse is going to be a well formed XML document (with a single root) only if whatever rules match the input produce one element that encloses all others. So watch your step! > > (b) The parser is responsible for checking that the grammar will always produce a single-rooted XML document with a single outermost element, and flagging an error in the grammar if this is not guaranteed. It is indeed meant to require that a conformant grammar produce a conformant XML serialization. This is thus allowed: -input: -~[]*, -"Last-modified: ", date, -~["0"-"9"], -~[]*. date: y, -"-", m, -"-", d. @y: n. @m: n. @d: n. n: ["0"-"9"]+. so that any input that contains the string Last-modified: followed by a date is acceptable, and will produce a serialization like <date y="2021" m="06" d="09"/> (and if the input contains more than one matching Last-modified: <date ixml:state="ambiguous" y="2021" m="06" d="09"/> ) But this wouldn't be allowed: -input: -~[]*, -"Last-modified: ", date, -"T", time, -~["0"-"9"], -~[]*. because the serialization wouldn't have a root element. My feeling is that if we are going to require that serialised rule-names match XML names, then we should require the output to be valid XML. Or vice-versa. > Q2. In "The grammar" / "Terminals", for > > The number must be within the Unicode code-point range. > > perhaps read > > The number must be within the Unicode code-point range and should normally identify a code point of type Graphic or Private Use (informally: assigned Unicode characters, or code points in private-use areas). If necessary, encoded characters may identify code points of type Format, Control, or Reserved (i.e. unassigned). Encoded characters must not identify code points of type Surrogate or Noncharacter, which do not represent characters. > > We discussed this a bit; I have now done a little homework on the Unicode Consortium web site looking for the correct terminology. "Code point" denotes any number between 0 and x10FFFF inclusive, so my earlier idea that we could make "within the code-point range" exclude surrogates is a non-starter. Again I feel we should be consistent: either say caveat emptor, and let the author take care of what is produced, or enforce it. > Q3. In "Conformance" there is some redundancy and some difference between the first and last items in the bulleted list: > > - All rule names that are serialised must match the requirements for an XML name. > ... > - All nonterminal names which are marked to be serialised must match the requirements of an XML name. Agree, and I marked this out with a comment. Also: For every nonterminal name occurring on the right-hand side of a rule, exactly one rule defining that name must exist in the grammar. The grammar must not contain more than one rule defining any given name. If we dropped the second rule, you would be allowed to have more than one rule with the same name as long as it wasn't used. Steven
Received on Wednesday, 9 June 2021 12:55:26 UTC