- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Sun, 13 Feb 2022 19:34:51 -0700
- To: Bethan Tovey-Walsh <accounts@bethan.wales>
- Cc: Norm Tovey-Walsh <norm@saxonica.com>, public-ixml@w3.org
Bethan Tovey-Walsh writes: > I would tentatively suggest we consider saying that implementations > must report that there is more than one possible valid vxml > output. That wouldn’t require committing to a single definition of > ambiguity, and would (I suspect) be the most useful information for > users. I don’t imagine a user being particularly interested in the > intermediate (and potentially ambiguous) states their grammar goes > through in order for my implementation to parse an input. I do imagine > their being interested in knowing that the vxml output they’ve > received is one amongst a larger number of possible outputs. I think that amounts to saying that what must be reported is that there is more than one XML document (wave hands here in the direction of the canonicalization spec) that reflects the input string and the grammar and markings in the input grammar. I agree that that is likely to be the easiest to explain. I see three complications that make me reluctant to adopt this approach in the spec right now. 1 One motive for reporting ambiguity is that it tends to slow down an Earley parser. It does so whether the ambiguity results in a different XML output or not, so reporting ambiguity only when it affects the XML may be suboptimal. 2 Another motive is that ambiguity may be a symptom of something wrong in the grammar. That too is a question of the raw parse tree (what you would get if you had no marks in the grammar), not the XML output. On the other hand: I can alway inform the user of the ambiguity anyway, using a different signal. And even if I don't, the user who wants to know whether a sentence is unambiguous against the grammar without marks can simply remove all the marks from their grammar. So neither 1 nor 2 seems conclusive. 3 I conjecture that if a processor creates a data structure representing a shared parse forest, or a parse forest grammar, then it will be possible to tell by inspection of the data structure / parse-forest grammar whether there is more than one XML representation for the output or not. If that's so, then any implementation which builds such a data structure (like nineml) or constructs a parse-forest grammar describing the result (as Aparecium will do, someday, if the gods are willin' and the creek don't rise) will have no problem with an 'ambiguity' rule focus on the abstract syntax tree instead of on the parse tree. (I put scare quotes around it because when applied only to ASTs it is no longer ambiguity as I normally understand the term.) But what I have is a conjecture, not a proof. And nothing in the spec currently requires implementations to create such data structures. For other ways of implementing ixml, I don't think we have even a conjecture that the problem can be solved without materializing all of the possible parses of the input as XML documents and comparing them for equality, until it's known whether there is more than one XML representation. If there are 48 parses, we don't have to compare all 48 trees pairwise: we only have to compare the first two, to see if they are different. If they are, then we know the sentence has multiple XML representations. If they aren't, then we need to compare either the first or the second (it doesn't matter which, since they are the same structurally) to the third, and so on. As soon as we find a tree that differs from the first tree, we are done. As soon as we reach the end of the set of trees, we are done. The problem is, there is no guarantee that either of those events will ever happen. Those not deeply interested in this question may stop reading now; I have made my point. What follows is a technical excursus on an example. - Michael Excursus: detection that an ambiguous sentence has a unique XML representation Consider the grammar S = A. A = A; B. B = '◎'. I believe this grammar defines a language with one sentence: {'◎'}. For that sentence, there are a countably infinite number of parse trees, with one A, with two, with three, and so on. One for every positive integer. If we replace the second rule with -A = A; B. then I believe that each of those parse trees reduces to the same XML, namely <S><B>◎</B></S>. I hope it's clear that they do. What I do not believe is that it will be equally clear to a routine in a programming language tasked with taking the set of Earley items generated during recognition of the sentence and generating parse trees. In fact, thinking about this concrete example, I'm not entirely certain I know how to make a processor see that all the XML representation are the same by inspection of the parse forest grammar. That grammar might look like this: S·0·1 = A·0·1. -A·0·1 = A·0·1; B·0·1. B·0·1 = '◎'. In a parse forest grammar, each nonterminal has three parts: the first part of the symbol is a nonterminal symbol from the original grammar, the second is a starting position in the input, and the third is an ending position in the input. The utility of the labels is easier to see in examples where the parse forest grammar is not exactly the same as the original grammar but with funny affixes. The existence of two right-hand sides for A·0·1 tells us the sentence is ambiguous. The fact that there is a cycle from A·0·1 to A·0·1 tells us the sentence is infinitely ambiguous. The conjecture I mentioned earlier says we (and more importantly a program) can also see from this grammar that there is only one XML representation of the sentence; when I try to say how that works, however, I am not confident I know how to write the test. Here's an attempt; let's see if it works for this case. 1 Take the closure of the parent/child relation among nodes in the parse trees generated the first and the second right-hand sides of the ambiguous nonterminal, preserving the effective markings (thus treating N and -N and @N as three different symbols). In the example that gives us: (first alt): { -A·0·1, B·0·1, substring($input, 1, 1) } (second alt): { B·0·1, substring($input, 1, 1) } 2 From each set, remove all symbols marked hidden. In the example that gives us: (first alt): { B·0·1, substring($input, 1, 1) } (second alt): { B·0·1, substring($input, 1, 1) } 3 If the two sets are the same, the XML representations will be the same. Informally: the nonterminals and terminals not marked hidden are those which are going to show up in the XML representation. If they are the same, the XML representation will be the same. Unfortunately, that last sentence is not true. Here is a counterexample. S = A; B. -A = C. -B = D. C = D. D = C; 'x'. In the parse forest grammar, every nonterminal will have ·0·1 appended to its name; since they are all the same in this example, they convey no information and I'll spare myself the effort. S has two right-hand sides, each of which generates the same set of non-hidden nodes: { C, D, substring($input, 1, 1) }. But <S><C><D>x</D></C></S> and <S><D><C>x</C></D></S> aren't the same, even though they have the same three node labels. In this case, I think we can see that the grammar generates at least two different trees because the two right-hand sides of D generate two different sets of non-hidden nodes. A third example: S = A; B. -A = C, D. -B = D, C. C = . D = . Here only S is ambiguous, and its two RHS generate the same set of labeled nonterminals but not the same XML tree. There may be a way to prove the conjecture, but so far what I've learned this evening is that the method I had in mind can suffice to prove that there are at least two different XML trees, but not, in general, to prove that there is only one. It might work on some subset of parse-forest grammars. Summing up: I can see the appeal of defining the ambiguity-symbol as applying when there is more than one possible XML output for a sentence. But I don't want the ixml spec to require that processors be able to tell for certain whether a given sentence will generate more than one different XML representation, because I do not know how to do it for all cases. Michael > But there’s a good chance I’m overlooking something obvious which will make this suggestion unworkable. > > > >> On 13 Feb 2022, at 23:03, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote: >> >> >> Norm Tovey-Walsh writes: >> >>> Consider this grammar: >> >>> list: word + -',' . >>> word: c, v, c ; c, v, v . >>> -c: ["bcdfghjklmnpqrstvwxyz"] . >>> -v: ["aeiouy" ]. >> >>> and this input: “hey,bee”. >> >>> By one reckoning, that’s ambiguous: “hey” can be either cvc or cvv >>> because I’ve identified “y” as both a consonant and a vowel. >> >>> But “c” and “v” are both elided from the output, so the generated XML >>> is identical for both parses. By that reckoning, it isn’t ambiguous. >> >> Nice example. >> >>> I expect our intent is that it *isn’t* ambiguous…but I thought I’d >>> check. >> >> For what it's worth ... >> >> If we as a group have an intent here, I don't know what it is. Issue 26 >> asks how we wish to define "ambiguity", but the upshot of the discussion >> starting from your message of 5 January [1] seems to me to have been >> only that some members of the CG do not wish to define it; without a >> definition of what counts as ambiguity, I don't think we can have a >> coherent intent. >> >> If I had to predict what the CG will end up doing, my money would be on >> leaving the effective definition of ambiguity implementation-defined and >> specifying that processors MAY report ambiguity, rather than MUST, or >> possibly specifying that IF processors detect ambiguity they MUST report >> it (modulo user option to suppress the ambiguity flag in the output), >> but not requiring that they detect ambiguity whenever it exists. (I >> predict there will be CG members who would like to require the detection >> of ambiguity, but without a crisp definition of ambiguity that >> requirement lacks teeth.) >> >> One difficulty is that we have already seen that different >> implementations of ixml use different underlying parsing methods, even >> when the implementors all say they are using Earley parsing. But: >> >> - Implementations that parse using the ixml grammar directly and those >> which translate the ixml grammar to BNF for parsing are working with >> different grammars; their raw parse trees will differ and ambiguity in >> one doesn't always mean ambiguity in another. >> >> - Implementations which translate the ixml grammar to BNF will not >> necessarily use the same translations -- the obvious requirement is >> that the grammar be equivalent and allow the construction of the XML >> abstract syntax tree, and there is more than one BNF that meets those >> criteria. >> >> - I think that some ways of recording parsing results may make it easy >> to see whether there is more than one XML AST for a given sentence, >> but I'm not sure that's true for every possible approach. >> >> We don't want to constrain the internals of any implementation, and we >> want interoperability, and we want ambiguity to be flagged. I don't >> think all of those three can be combined in their pure form; we are >> going to have to weaken one or more of them. >> >> My two cents. >> >> Michael >> >> -- >> C. M. Sperberg-McQueen >> Black Mesa Technologies LLC >> http://blackmesatech.com >> >> [1] https://lists.w3.org/Archives/Public/public-ixml/2022Jan/0030.html >> -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Monday, 14 February 2022 02:35:17 UTC