- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Mon, 14 Feb 2022 12:47:00 -0700
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: Norm Tovey-Walsh <norm@saxonica.com>, public-ixml@w3.org
Steven Pemberton writes: > The text in the spec is > "If more than one parse tree describes the input, the processor must > serialize one of them. It is not defined how this choice is made, but > the resulting parse should be marked as ambiguous" I note that in the section on Parsing, the wording is slightly different: If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting parse should be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation. Each wording has a little wiggle room: * The passage from the "Parsing" section talks about parses that "result" from the processor's activity. That seems to leave open the possibility that a conforming processor might find one tree, not find a second tree because it doesn't look for one, and report the one it found. * It also seems to leave open the possibility that two processors might rewrite the grammar into BNF in different ways, with one processor's choices resulting in multiple parse trees and the other processor's choices not resulting in multiple parse trees. (Example: S := 'a'*; 'b'*. can be rewritten in BNF in at least one way that makes the empty string ambiguous, and in at least one way that makes it unambiguous. Since I do not wish to prescribe that grammars must be rewritten to BNF before parsing, and do not wish to prescribe a particular rewriting, leaving things open for inter-processor variation seems attractive to me. * The passage from the conformance section appears to rule out processor-dependent detection of the ambiguity: it doesn't talk about multiple parse trees resulting from the processor's attempts to parse the input, but in purely declarative terms about the existence of multiple parse trees describing the input. But the only occurences of "must" in that list item are that processors must serialize one of the trees, and that the default mode of operation must be to produce one, not more than one, tree, if any parse trees exist. The provision of the ambiguity flag is a "should", not a "must". If the wiggle room identified is not intended, we may need to adjust the wording to eliminate it. If the wiggle room is intended, we may wish to adjust the wording to make it clearer what will and what won't vary among conforming processors. Another source of wiggle room easily spotted in this connection is that the term "parse tree" is not defined. At least three meanings may be imagined: * The XML representation specified by the ixml grammar used. (Some passages in the spec use the term in this way.) * A parse tree whose interior nodes are all labeled with nonterminals in the ixml grammar. * A parse tree whose interior nodes are all labeled either with nonterminals in the ixml grammar or with nonterminals in a grammar derived from it (e.g. by the rewritings described in the "Hints for Implementers" section). In discussions of ambiguity, I think some CG members have appealed implicitly to a fourth meaning: * A parse tree whose interior nodes are all labeled either with nonterminals in the ixml grammar, or with nonterminals in a grammar derived from it (e.g. by the rewritings described in the "Hints for Implementers" section), or with sub-expressions from right-hand sides in the grammar (e.g. the choice "a"* or the choice "b"* -- any alt in a set of alts, ...) This has been accompanied by a certain amount of hand-waving, but I think it could be made more precise by formulating a set of rewrite rules which would map from the ixml grammar supplied by the user to a (BNF) grammar in which there are nonterminals for all the potential interior nodes in all potential parse trees. I thus think it's probably best treated as an instance of the third bullet identified above. Perhaps it would be worth re-writing the second paragraph of the section headed "Parsing" to read Processors must accept and parse any conforming grammar, and produce at least one parse of any input that matches the grammar starting at the root symbol. If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting parse should be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation. The ixml namespace URI is "http://invisiblexml.org/NS". Note that if some nonterminals are marked "-", different parse trees generated by the underlying grammar may produce the same XML representation. The details of ambiguity detection and marking are implementation-dependent. The first two paragraphs here are unchanged from the existing text' the changes here are (a) the introduction of a paragraph break at "If more" and (b) the additional material beginning at "Note that". Michael > Steven > > On Monday 14 February 2022 16:11:29 (+01:00), Steven Pemberton wrote: > >> Sure it is. Ambiguity is in the input, not in the output. It's why we go to such lengths about spaces in the ixml grammar. >> >> I call it "harmless" ambiguity, but spotting if a parse is ambiguous is easy (it's just a tree walk); spotting whether it is harmless is not easy: you have to compare the generated output from all the possible ambiguous parse trees, and there may be a lot of them. >> >> Ambiguous grammars also slow parse time down, sometimes by a lot. >> >> Steven >> >> On Sunday 13 February 2022 11:33:56 (+01:00), Norm Tovey-Walsh wrote: >> >> > Hello world, >> > >> > Consider this grammar: >> > >> > list: word + -',' . >> > word: c, v, c ; c, v, v . >> > -c: ["bcdfghjklmnpqrstvwxyz"] . >> > -v: ["aeiouy" ]. >> > >> > and this input: “hey,bee”. >> > >> > By one reckoning, that’s ambiguous: “hey” can be either cvc or cvv >> > because I’ve identified “y” as both a consonant and a vowel. >> > >> > But “c” and “v” are both elided from the output, so the generated XML is >> > identical for both parses. By that reckoning, it isn’t ambiguous. >> > >> > I expect our intent is that it *isn’t* ambiguous…but I thought I’d check. >> > >> > Be seeing you, >> > norm >> > >> > -- >> > Norm Tovey-Walsh >> > Saxonica >> > -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Monday, 14 February 2022 19:47:19 UTC