Re: What about this grammar?

On Mon, Sep 12, 2022 at 09:55:07AM +0100, Norm Tovey-Walsh scripsit:
[snip]
> The discussion here is about U+0013 in an UTF-8 (or US ASCII similarly
> encoded) document. Which I admit, I did not make clear.

I am easily befuddled!

I think there are maybe three questions --

1. does the source document fed to an ixml parser have any constraints
on contents beyond all being in some encoding known to the parser?

2. is the ixml grammar document a representation of XML, using the same
rules as an XML document with respect to what code points are
permissible in the document?

3. if the ixml grammar document is NOT a representation of XML, are
there restrictions on the contents?

I think the answers are appropriately "no", "yes", and "not relevant due
to 2 being yes".

If 3 requires an answer, I get stuck on "the parsed result is XML so we
need mapping rules for what happens when a not-XML character gets used
where it would become an element name" and so on. That seems like a hard
problem, and I don't know of any compelling reason to try to solve it.

If it's just "you can have anything as a terminal symbol in your ixml
grammar", there's still the issue of "and you just created a text node
with that non-XML character in it".  You original example is OK because
it drops U+0013; it wouldn't be if it put that character into a text
node.  General case rules for what to do in that case also seem hard.

All of which makes me think I'm missing something.  Why would you want
to allow arbitrary literal code points in the ixml grammar?

-- 
Graydon Saunders  | graydonish@gmail.com
Þæs oferéode, ðisses swá mæg.
-- Deor  ("That passed, so may this.")

Received on Monday, 12 September 2022 11:33:45 UTC