Re: What about this grammar? from Bethan Tovey-Walsh on 2022-09-12 (public-ixml@w3.org from September 2022)

From: Bethan Tovey-Walsh <accounts@bethan.wales>
Date: Mon, 12 Sep 2022 15:36:52 +0100
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: Graydon Saunders <graydonish@gmail.com>, Norm Tovey-Walsh <norm@saxonica.com>, public-ixml@w3.org
Message-Id: <50699423-151B-4394-A795-A10CA5E39DD7@bethan.wales>
I don’t think this clearly prohibits a grammar with a literal character, though. That would be a dynamic error, not a static one, so (as you said previously, Steven) if the grammar is never serialized, there’s no error. A parser that doesn’t serialize the input grammar to XML never falls foul of the rule "Any serialization of a parse tree produced from the grammar must be well-formed XML.”


> On 12 Sep 2022, at 15:14, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
> I think the spec already covers this. It says:
> Any serialization of a parse tree produced from the grammar must be well-formed XML.
> XML has some fairly arbitrary restrictions that this covers, but I would object to moving those restrictions up into the ixml language, simply because XML serialization is not the only thing that you can do with an ixml-parsed document.
> 
> We had long discussions about adding the restrictions to ixml, and I rewrote parts of the spec consequently many times, until we came up with that brilliant formulation above. I think we're covered.
> 
> Steven
> 
> On Monday 12 September 2022 15:41:36 (+02:00), Bethan Tovey-Walsh wrote:
> 
> > I propose that we make an amendment to the spec along these lines:
> > 
> > - An iXML grammar must be capable of being serialized to XML when parsed using the iXML specification grammar.
> > 
> > This would mean that a grammar with a literal U+0019 control character in it would be non-conforming, because that character cannot be represented literally in XML. But a grammar using a hex-encoded U+0019 character (i.e. #19) would be fine, because the XML serialization would be well-formed:
> > 
> > match: -#19, ‘a’.
> > 
> > <rule name=“match”>
> > <alt>
> > <literal tmark=“-“ hex=“19”/>
> > </alt>
> > </rule>
> > 
> > I think it would also be a good idea to add some wording spelling out the implications, such as:
> > 
> > - In an ixml grammar, characters that are not legal in XML must be represented as encoded characters, and must be excluded from the output by being marked with a “-”.
> > 
> > I’m not making a pull request for any of this, since I’m not yet clear on what we’re doing towards v-next.
> > 
> > All best,
> > 
> > BTW
> > 
> > > On 12 Sep 2022, at 12:33, Graydon <graydonish@gmail.com> wrote:
> > > 
> > > On Mon, Sep 12, 2022 at 09:55:07AM +0100, Norm Tovey-Walsh scripsit:
> > > [snip]
> > >> The discussion here is about U+0013 in an UTF-8 (or US ASCII similarly
> > >> encoded) document. Which I admit, I did not make clear.
> > > 
> > > I am easily befuddled!
> > > 
> > > I think there are maybe three questions --
> > > 
> > > 1. does the source document fed to an ixml parser have any constraints
> > > on contents beyond all being in some encoding known to the parser?
> > > 
> > > 2. is the ixml grammar document a representation of XML, using the same
> > > rules as an XML document with respect to what code points are
> > > permissible in the document?
> > > 
> > > 3. if the ixml grammar document is NOT a representation of XML, are
> > > there restrictions on the contents?
> > > 
> > > I think the answers are appropriately "no", "yes", and "not relevant due
> > > to 2 being yes".
> > > 
> > > If 3 requires an answer, I get stuck on "the parsed result is XML so we
> > > need mapping rules for what happens when a not-XML character gets used
> > > where it would become an element name" and so on. That seems like a hard
> > > problem, and I don't know of any compelling reason to try to solve it.
> > > 
> > > If it's just "you can have anything as a terminal symbol in your ixml
> > > grammar", there's still the issue of "and you just created a text node
> > > with that non-XML character in it". You original example is OK because
> > > it drops U+0013; it wouldn't be if it put that character into a text
> > > node. General case rules for what to do in that case also seem hard.
> > > 
> > > All of which makes me think I'm missing something. Why would you want
> > > to allow arbitrary literal code points in the ixml grammar?
> > > 
> > > -- 
> > > Graydon Saunders | graydonish@gmail.com
> > > Þæs oferéode, ðisses swá mæg.
> > > -- Deor ("That passed, so may this.")
> > > 
> > 
> > 
> >
Received on Monday, 12 September 2022 14:37:09 UTC