Re: What about this grammar? from Norm Tovey-Walsh on 2022-09-13 (public-ixml@w3.org from September 2022)

From: Norm Tovey-Walsh <norm@saxonica.com>
Date: Tue, 13 Sep 2022 09:05:44 +0100
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Cc: graydonish@gmail.com, public-ixml@w3.org
Message-ID: <m2o7vjmzi1.fsf@saxonica.com>
"C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com> writes:
> Norm Tovey-Walsh <norm@saxonica.com> writes:
>
>> [[PGP Signed Part:Undecided]]
>
>>> I'm not seeing much upside to allowing literal control characters not
>>> permitted in XML in the grammar via some additional notational
>>> mechanism.
>
> Like Graydon, I see no upside to this.

As I said before, I wasn’t proposing any new notation. I was just using
a notation that would survive email transmission.

> The discrepancies that already
> exist between ixml and XML (e.g. in the definition of identifiers) don't
> make ixml a better or more attractive language; they only set a trap for
> users.

Arguably, the trap already exists and it has nothing to do with iXML. I
was very disappointed when I investigated this to discover that Xerces[*]
still uses the fourth edition rules for names. You might think

  <Ͱ>Heta</Ͱ>

is a perfectly reasonable XML document. And XML 5e would support you.
But 4e would not and neither would Xerces. :-( [Expletive deleted.]

Anyway, none of this is really relevant to the question at hand.

> In real life, every person I know who has dealt seriously with
> character-set and character-encoding issues would write the ixml grammar
> in question with #13, not with a literal control-S , even if they did

It’s #19, not #13. Use of #13 was a typo or a thinko on my part.

> not plan to transmit it over the network.  So far the only grammars I
> have seen that exercise this interoperability problem are in the test
> suite, and at least half of those grammars were written by me, so I
> don't think many real users will be affected.

For what it’s worth, *I* wrote it with a literal character and not
encoded as #19. I was thinking about Graydon’s problem of ambiguous
“delete” and “insert” words. I thought,

* “This would be easy if they were marked in some way with a character
  that wasn’t a word character.”
* “Hmm, the definition of word here is pretty broad.”
* “I’m going to exclude it, so it can be anything.” I thought of
  Control-S (a mnemonic for “start of word”, though now that I think
  about it there’s probably already a control character that means
  that).
* And I banged in -, ', Ctrl-Q, Ctrl-S, '. (A sequence of keystrokes
  that Emacs users will recognize as a way to insert a literal Control S
  into a file.)

And it fell over. I found a different character before I thought of
encoding it as #19, though having realized I could have encoded it that
way, it is obvious that that is what I *should* have done.

But at the time, my next thoughts were that I *could* change my parser
so that it would accept #19 literally in a grammar. After a few minutes
of investigation, I concluded that I was not required to do so, and
consequently that doing so would introduce an incompatibility between
processors.

No one, AFAICT, has suggested that we *should* allow literal #19
characters in the file, so really we’re just talking about how to make
it explicit that they’re forbidden. Stephen has proposed that this is
already the case, I don’t think that’s clear enough.

> I suppose, in the end, my position is:
>
>   - The ways in which ixml deviates from XML as regards allowable
>     characters and allowable names are design errors.
>
>   - I would be happy to vote for a proposal to repair those design
>     errors in the obvious ways.

I bristle at this a bit because I think Xerces is wrong. If Xerces
accepted 5e rules for names, there would be, IIRC, three characters
allowed by iXML in names that are not allowed in XML (masculine and
feminine ordinal symbols and the micro sign).[**] I find the short,
simple form of the rule in iXML sufficient justification (on aesthetic
grounds if nothing else) for allowing this small discrepancy.

If we accept that the world is forever stuck with the 4e rules, then I
agree, we should restrict what iXML allows. But that ship has probably
sailed.

>   - What I think is the obvious solution is to say explicitly in the
>     spec that in input grammars and input strings conforming processors
>     are required to accept any characters that would be legal in XML
>     1.0, and in input grammars they are required to accept any
>     nonterminals which are XML names, and to add that conforming
>     processors MAY accept other character in input and MAY accept
>     nonterminals which are not XM names.

I don’t think that goes far enough. I don’t think non-XML characters
should be allowed in iXML grammars at all.

                                        Be seeing you,
                                          norm

[*] I know there are other parsers, but I spend most of my life in the
Java ecosystem where Xerces is the overwhelmingly common choice for
parsing XML unless there is a *compelling* reason to choose some other
parser.
[**] I suppose I should craft a pull request to fix Xerces and see what
happens.

--
Norm Tovey-Walsh
Saxonica
Received on Tuesday, 13 September 2022 08:48:25 UTC