- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Wed, 14 Apr 2021 10:27:42 -0600
- To: public-ixml@w3.org
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Sorry, forgot one thing. > On 13,Apr2021, at 6:41 PM, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote: > > ... > Full disclosure: On three points, the text below about parsers goes beyond what is in the draft of 2021-04-06. > > - The penultimate item in the list talks about how the XML documents are to be returned. It is intended to encourage the kind of command-line interface that would allow an ixml parser to be used as a stage in a shell pipeline, but also to allow other interfaces (including -- full disclosure -- the interface I have specified for Aparecium, which is intended to be called from XSLT or XQuery and to produce an XDM instance that the rest of the program can then process). > > - The final item is about parsing the whole of the input against the grammar, or against some portion of the input. It is intended to address the topic we spent our meeting on today. It tries to require that the behavior Steven described be (a) available, and (b) the default, when the behavior is meaningful. For the case of streams of indeterminate length, however, it does not say anything about maximal consumption of the input or about greedy parsing, so it does not go as far as some of us suggested. Steven's use case involves files, if I understood him correctly, and I understood him to mean "normal" files, not special cases like processes or infinite streams. The text suggested is intended to say that a conforming parser may offer to work on infinite streams, but to say nothing more on that topic, on the principle of Least said, soonest mended. > > - The item beginning "If more than one parse tree describes the input" proposes a change. The current spec says that if there is more than one tree, the parser must return one. The wording below weakens this requirement to say that parsers may return one, may return more than one, must be capable of returning just one, and that returning just one tree should be the default. We should discuss this to make sure people are happy with it. Make that four points, the fourth about grammars not about parsers. The current draft says in the body that hex values must be “within the Unicode code-point range” and in the conformance section that they must be "within the Unicode range”. I believe that what is intended is that the hex value should be the code point of a character or a code point which may be assigned to a character in some future version of Unicode / ISO 10646. It should not be a code point that Unicode defines as not now a character and not ever to be a character. I don’t know how best to formulate that rule; the best I could come up with was to add the sentence (This entails that the hex value must not be that of a surrogate code point.) This is not really satisfactory, for a couple of reasons. First, it attempts unconvincingly to persuade the incautious reader that “Unicode range” is a well defined concept meaning what we want it to mean. That’s not true, so if we want a normative rule that says hex values in the surrogate range, like #D843, must not be used, we should say it directly instead of pretending it’s entailed by something else that’s not crisply defined. And if we want to restrict hex values to code points actually or potentially assigned to characters, then U+FFFE and U+FFFF should probably also be mentioned. (I don’t know why they are ruled out, I only know that the FAQ at [1] says they are.) [1] http://www.unicode.org/faq/basic_q.html And make it five: the conformance section should say explicitly not just the conforming ixml grammars should be accepted and interpreted as defined, but also EITHER that input that is not a conforming grammar must not be accepted as a grammar, OR that when the grammar does not conform, then the behavior of a conforming parser is undefined. (I have seen it both ways. The SQL guys all say that SQL takes the latter course — undefined, not forbidden — in order to allow vendor extensions, which are necessary to explore new requirements. XML, of course, takes the former course — Draconian error handling, recovery not allowed. XSLT takes a middle course: for certain easily predictable errors, there is a prescribed recovery behavior, so implementations can either error out or recover, and if they choose to recover they will all do the same thing. ISO Pascal, if I remember correctly, allows syntactic extensions but requires that conforming compilers have a way of being called in which they will reject all non-standard syntax. I am currently not sure what I think ixml should do.) Michael ******************************************** C. M. Sperberg-McQueen Black Mesa Technologies LLC cmsmcq@blackmesatech.com http://www.blackmesatech.com ********************************************
Received on Wednesday, 14 April 2021 16:28:02 UTC