Re: What about this grammar?

On Sun, Sep 11, 2022 at 10:45:12AM +0100, Norm Tovey-Walsh scripsit:
> > I'm not seeing much upside to allowing literal control characters not
> > permitted in XML in the grammar via some additional notational
> > mechanism.
> 
> I wasn’t proposing an additional notational mechanism. I can literally
> type a U+0013 character into a string in my editor. I can save that ixml
> file. (I used ^S because I was sure that a literal Control-S character
> wouldn’t survive email transmission; also because my editor renders a
> literal Control-S as a single character marked by two glyphs, ^ followed
> by S.)

I think -- very possibly wrongly! -- that ^S, ^M, etc. are a notational
convention to represent characters with no associated glyph.

> Anyway. I can create an iXML file that has a literal U+0013 in it.

I would not dare argue.

> If that’s forbidden, that’s fine. If it’s allowed but not required, I
> think that introduces an interoperability issue. If it’s required,
> that’s kind of a challenge because my parser builds its grammar from the
> XML representation, so it has no way to get from ixml text to parser
> without XML in the middle. (I can work around this problem with some
> clever escaping, but I’m not going to bother if it’s forbidden :-) )

This reminds of me of when there were issues with the Microsoft "high
ascii" characters, where (for example) cp1252 0097 was the em-dash
character but the code point wasn't legal in XML until the fifth
edition.

As I recall, pre-fifth-edition, I could have an 0097 codepoint character
in something that looked like an XML file, but it wouldn't parse.  I
think this is legitimately the same case; if you've got U+0013 as a code
point in ixml, it shouldn't parse.

-- 
Graydon Saunders  | graydonish@gmail.com
Þæs oferéode, ðisses swá mæg.
-- Deor  ("That passed, so may this.")

Received on Sunday, 11 September 2022 18:47:51 UTC