Re: What about this grammar?

Norm Tovey-Walsh <norm@saxonica.com> writes:

> [[PGP Signed Part:Undecided]]
> "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com> writes:
>> Norm Tovey-Walsh <norm@saxonica.com> writes:
>>
>>> [[PGP Signed Part:Undecided]]
>>
>>>> I'm not seeing much upside to allowing literal control characters not
>>>> permitted in XML in the grammar via some additional notational
>>>> mechanism.
>>
>> Like Graydon, I see no upside to this.
>
> As I said before, I wasn’t proposing any new notation. I was just using
> a notation that would survive email transmission.

Sorry; I may have mis-parsed Graydon's remark.

The sentiment I thought I was agreeing with was that I don't see much
upside in allowing literal control characters in ixml grammars and that
there is no need to go beyond the notations (hex literals) already
provided for control characters.

>> In real life, every person I know who has dealt seriously with
>> character-set and character-encoding issues would write the ixml grammar
>> in question with #13, not with a literal control-S , even if they did
>
> It’s #19, not #13. Use of #13 was a typo or a thinko on my part.

? Control-S is hex 13, decimal 19.  I think you were right the first
time.  Side issue.

>> not plan to transmit it over the network.  So far the only grammars I
>> have seen that exercise this interoperability problem are in the test
>> suite, and at least half of those grammars were written by me, so I
>> don't think many real users will be affected.
>
> For what it’s worth, *I* wrote it with a literal character and not
> encoded as #19. I was thinking about Graydon’s problem of ambiguous
> “delete” and “insert” words. I thought,
>
> * “This would be easy if they were marked in some way with a character
>   that wasn’t a word character.”
> * “Hmm, the definition of word here is pretty broad.”
> * “I’m going to exclude it, so it can be anything.” I thought of
>   Control-S (a mnemonic for “start of word”, though now that I think
>   about it there’s probably already a control character that means
>   that).
> * And I banged in -, ', Ctrl-Q, Ctrl-S, '. (A sequence of keystrokes
>   that Emacs users will recognize as a way to insert a literal Control S
>   into a file.)

> And it fell over. I found a different character before I thought of
> encoding it as #19, though having realized I could have encoded it that
> way, it is obvious that that is what I *should* have done.

OK.  I stand corrected!

> No one, AFAICT, has suggested that we *should* allow literal #19
> characters in the file, so really we’re just talking about how to make
> it explicit that they’re forbidden. Stephen has proposed that this is
> already the case, I don’t think that’s clear enough.

I think the legality of literal control characters is implicit in the
principle that input can be Unicode (without qualification that I can
see), though that principle itself appears to be only implicit:  if
there is an explicit statement about the acceptable input characters, I
am missing it.  The rule for the meaning of hex-encoded literals
specifies how their hex numbers are interpreted, but that doesn't
actually imply logically that for every legal hex value there is some
legal input that would match it.  (It would be eccentric were that not
the case, but specifications have been eccentric before now, so I don't
think that's a compelling argument.)

>> I suppose, in the end, my position is:
>>
>>   - The ways in which ixml deviates from XML as regards allowable
>>     characters and allowable names are design errors.
>>
>>   - I would be happy to vote for a proposal to repair those design
>>     errors in the obvious ways.
>
> I bristle at this a bit because I think Xerces is wrong. If Xerces
> accepted 5e rules for names, there would be, IIRC, three characters
> allowed by iXML in names that are not allowed in XML (masculine and
> feminine ordinal symbols and the micro sign).[**] I find the short,
> simple form of the rule in iXML sufficient justification (on aesthetic
> grounds if nothing else) for allowing this small discrepancy.

You may be right.

> If we accept that the world is forever stuck with the 4e rules, then I
> agree, we should restrict what iXML allows. But that ship has probably
> sailed.

No, for better or worse I think 5e should be our point of reference.
The failure of some infrastructure to support it does suggest that the
problem is more complicated than my mail painted it.

>>   - What I think is the obvious solution is to say explicitly in the
>>     spec that in input grammars and input strings conforming processors
>>     are required to accept any characters that would be legal in XML
>>     1.0, and in input grammars they are required to accept any
>>     nonterminals which are XML names, and to add that conforming
>>     processors MAY accept other character in input and MAY accept
>>     nonterminals which are not XM names.
>
> I don’t think that goes far enough. I don’t think non-XML characters
> should be allowed in iXML grammars at all.

I could live with that (but I will be surprised if the CG as a whole
can).


> [**] I suppose I should craft a pull request to fix Xerces and see what
> happens.

That would be a very public spirited thing to do.  Much more
constructive than just making fun of Xerces for being out of date, which
was my best effort at an idea for what to do.

Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Tuesday, 13 September 2022 16:58:35 UTC