- From: Norm Tovey-Walsh <norm@saxonica.com>
- Date: Tue, 13 Sep 2022 09:05:44 +0100
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Cc: graydonish@gmail.com, public-ixml@w3.org
- Message-ID: <m2o7vjmzi1.fsf@saxonica.com>
"C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com> writes: > Norm Tovey-Walsh <norm@saxonica.com> writes: > >> [[PGP Signed Part:Undecided]] > >>> I'm not seeing much upside to allowing literal control characters not >>> permitted in XML in the grammar via some additional notational >>> mechanism. > > Like Graydon, I see no upside to this. As I said before, I wasn’t proposing any new notation. I was just using a notation that would survive email transmission. > The discrepancies that already > exist between ixml and XML (e.g. in the definition of identifiers) don't > make ixml a better or more attractive language; they only set a trap for > users. Arguably, the trap already exists and it has nothing to do with iXML. I was very disappointed when I investigated this to discover that Xerces[*] still uses the fourth edition rules for names. You might think <Ͱ>Heta</Ͱ> is a perfectly reasonable XML document. And XML 5e would support you. But 4e would not and neither would Xerces. :-( [Expletive deleted.] Anyway, none of this is really relevant to the question at hand. > In real life, every person I know who has dealt seriously with > character-set and character-encoding issues would write the ixml grammar > in question with #13, not with a literal control-S , even if they did It’s #19, not #13. Use of #13 was a typo or a thinko on my part. > not plan to transmit it over the network. So far the only grammars I > have seen that exercise this interoperability problem are in the test > suite, and at least half of those grammars were written by me, so I > don't think many real users will be affected. For what it’s worth, *I* wrote it with a literal character and not encoded as #19. I was thinking about Graydon’s problem of ambiguous “delete” and “insert” words. I thought, * “This would be easy if they were marked in some way with a character that wasn’t a word character.” * “Hmm, the definition of word here is pretty broad.” * “I’m going to exclude it, so it can be anything.” I thought of Control-S (a mnemonic for “start of word”, though now that I think about it there’s probably already a control character that means that). * And I banged in -, ', Ctrl-Q, Ctrl-S, '. (A sequence of keystrokes that Emacs users will recognize as a way to insert a literal Control S into a file.) And it fell over. I found a different character before I thought of encoding it as #19, though having realized I could have encoded it that way, it is obvious that that is what I *should* have done. But at the time, my next thoughts were that I *could* change my parser so that it would accept #19 literally in a grammar. After a few minutes of investigation, I concluded that I was not required to do so, and consequently that doing so would introduce an incompatibility between processors. No one, AFAICT, has suggested that we *should* allow literal #19 characters in the file, so really we’re just talking about how to make it explicit that they’re forbidden. Stephen has proposed that this is already the case, I don’t think that’s clear enough. > I suppose, in the end, my position is: > > - The ways in which ixml deviates from XML as regards allowable > characters and allowable names are design errors. > > - I would be happy to vote for a proposal to repair those design > errors in the obvious ways. I bristle at this a bit because I think Xerces is wrong. If Xerces accepted 5e rules for names, there would be, IIRC, three characters allowed by iXML in names that are not allowed in XML (masculine and feminine ordinal symbols and the micro sign).[**] I find the short, simple form of the rule in iXML sufficient justification (on aesthetic grounds if nothing else) for allowing this small discrepancy. If we accept that the world is forever stuck with the 4e rules, then I agree, we should restrict what iXML allows. But that ship has probably sailed. > - What I think is the obvious solution is to say explicitly in the > spec that in input grammars and input strings conforming processors > are required to accept any characters that would be legal in XML > 1.0, and in input grammars they are required to accept any > nonterminals which are XML names, and to add that conforming > processors MAY accept other character in input and MAY accept > nonterminals which are not XM names. I don’t think that goes far enough. I don’t think non-XML characters should be allowed in iXML grammars at all. Be seeing you, norm [*] I know there are other parsers, but I spend most of my life in the Java ecosystem where Xerces is the overwhelmingly common choice for parsing XML unless there is a *compelling* reason to choose some other parser. [**] I suppose I should craft a pull request to fix Xerces and see what happens. -- Norm Tovey-Walsh Saxonica
Received on Tuesday, 13 September 2022 08:48:25 UTC