- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Wed, 09 Jun 2021 12:54:27 +0000
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-ixml@w3.org
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Message-Id: <1623240355321.147839777.2770418418@cwi.nl>
Thank you for this Michael.
I adopted all of this (up to the discussion section), with some slight
changes to the wording. In particular I changed all "parsers" to
"processors". You will want to rereview I expect.
The result is visible at the usual location
https://homepages.cwi.nl/~steven/ixml/ixml-specification.html
You will note some comments visible in yellow in the conformance section
for discussion.
> On some items, I think we need discussion.
>
> Q1. In "Parsing", the text currently says:
>
> The root symbol of the grammar is the name of the first rule in the
grammar. If it is marked as hidden, all of its productions must produce
exactly one non-hidden nonterminal and no non-hidden terminals before or
after that nonterminal (in order to match the XML requirement of a
single-rooted document).
>
> I am not sure what this is saying; I suspect it's one or more of the
following.
>
> (a) If the root symbol of the grammar is hidden, the output of the parse
is going to be a well formed XML document (with a single root) only if
whatever rules match the input produce one element that encloses all
others. So watch your step!
>
> (b) The parser is responsible for checking that the grammar will always
produce a single-rooted XML document with a single outermost element, and
flagging an error in the grammar if this is not guaranteed.
It is indeed meant to require that a conformant grammar produce a
conformant XML serialization. This is thus allowed:
-input: -~[]*, -"Last-modified: ", date, -~["0"-"9"], -~[]*.
date: y, -"-", m, -"-", d.
@y: n.
@m: n.
@d: n.
n: ["0"-"9"]+.
so that any input that contains the string Last-modified: followed by a
date is acceptable, and will produce a serialization like
<date y="2021" m="06" d="09"/>
(and if the input contains more than one matching Last-modified:
<date ixml:state="ambiguous" y="2021" m="06" d="09"/>
)
But this wouldn't be allowed:
-input: -~[]*, -"Last-modified: ", date, -"T", time, -~["0"-"9"],
-~[]*.
because the serialization wouldn't have a root element.
My feeling is that if we are going to require that serialised rule-names
match XML names, then we should require the output to be valid XML. Or
vice-versa.
> Q2. In "The grammar" / "Terminals", for
>
> The number must be within the Unicode code-point range.
>
> perhaps read
>
> The number must be within the Unicode code-point range and should
normally identify a code point of type Graphic or Private Use (informally:
assigned Unicode characters, or code points in private-use areas). If
necessary, encoded characters may identify code points of type Format,
Control, or Reserved (i.e. unassigned). Encoded characters must not
identify code points of type Surrogate or Noncharacter, which do not
represent characters.
>
> We discussed this a bit; I have now done a little homework on the
Unicode Consortium web site looking for the correct terminology. "Code
point" denotes any number between 0 and x10FFFF inclusive, so my earlier
idea that we could make "within the code-point range" exclude surrogates is
a non-starter.
Again I feel we should be consistent: either say caveat emptor, and let the
author take care of what is produced, or enforce it.
> Q3. In "Conformance" there is some redundancy and some difference
between the first and last items in the bulleted list:
>
> - All rule names that are serialised must match the requirements for an
XML name.
> ...
> - All nonterminal names which are marked to be serialised must match the
requirements of an XML name.
Agree, and I marked this out with a comment.
Also:
For every nonterminal name occurring on the right-hand side of a rule,
exactly one rule defining that name must exist in the grammar.
The grammar must not contain more than one rule defining any given name.
If we dropped the second rule, you would be allowed to have more than one
rule with the same name as long as it wasn't used.
Steven
Received on Wednesday, 9 June 2021 12:55:26 UTC