- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Sun, 12 Jun 2022 13:07:07 -0600
- To: public-ixml@w3.org
This is not a spec-related issue any more, since the spec is explicit that conforming processors are allowed to vary in their flagging of ambiguity -- but because I am interested in parsing I continue wishing to understand the views of other group members on the topic, whether they are illustrated concretely by the behavior of a processor or just intuitions. Consider the following grammars and input strings: {G1} S = 'a'*; 'b'*. {input: empty string} G1 follows the pattern of the test case that led us to realize that different processors produce different results in cases of ambiguity. Accepting the empty string can be justified by appeal to either the left or the right choice, so it feels inherently ambiguous to some. On the other hand, there is only one parse tree, namely <S/>. {G2} S = A; B. A = 'a'*. B = 'b'*. {input: empty string} {G3} S = A; B. A = ; 'a', A. B = ; 'b', B. {input: empty string} G2 gives names to the left and right choices in the rule for S. G3 goes further an rewrites the grammar as BNF; it is the BNF to which an implementation might choose to reduce G1 or G2. For both of these grammars, there are two parse trees for the empty string: <S><A/></S> and <S><B/></S>. So the sentence is clearly ambiguous. {G4} S = A; B. -A = 'a'*. -B = 'b'*. {input: empty string} G4 highlights a thorny issue. The base grammar, ignoring the marks on the symbols, is the same as G2. But while there are two 'raw' parse trees (if I can use that term for a parse tree with a node for every nonterminal regardless of its marking), there is only one result tree, because the 'A' and 'B' nonterminals are marked hidden. Should that count as ambiguous because there are multiple (raw) parse trees, or as unambiguous because there is only one XML result tree? As a user, I lean toward the latter (easier to understand), as an implementor toward the former (easier to implement). For what it's worth, our earlier discussion of this topic led me to believe that the core of our difficulty is that formal definitions of ambiguity refer to grammars in a particular form (which I am somewhat loosely calling 'BNF'), and are not directly applicable to grammars in EBNF. The spec clearly allows implementations to parse the input by rewriting the grammar from EBNF to BNF, but (correctly, I think) refrains from prescribing a particular translation to BNF. Different ways of reducing a grammar to an equivalent BNF may differ in whether the sentence is ambiguous or not. G5, for example, is a BNF equivalent to G1 for which the empty string is not ambiguous: {G5} S = nil; As; Bs. -nil=. -As='a'; As, 'a'. -Bs='b'; Bs, 'b'. The fact that ambiguity is not crisply defined for EBNF certainly complicates the matter. But there appears to be more to it. Consider the next two grammars. {G6} name = ['a'-'z']; ['a'-'z']. {input: 'p'} {G7} name = A; B. A = ['a'-'z']. B = ['a'-'z']. {input: 'p'} G6 and G7 seem to illustrate the same points as G1 and G2, but they are already in BNF form, so whatever problem they raise does not seem to have anything to do with the definition of ambiguity for EBNF. If I understand the definitions in my formal-languages books correctly, G6 and G7 define the same language, and the sentence 'p' is ambiguous against grammar G7 but not against grammar G6. - It's ambiguous against G7 because there are two parse trees, namely <name><A>p</A></name> and <name><B>p</B></name>. - It's not ambiguous against G6 because formally a grammar has a set of productions of the form A --> alpha, where A is a nonterminal symbol and alpha is a string of symbols from (N union T)*. If we take ['a'-'z'] as a terminal symbol (which ixml tells us to do), then the left and right sides of the choice do not represent two different production rules but only one. So from a formal point of view, G6 is just a slightly eccentric way of writing G8: {G8} name = ['a'-'z']. {input: 'p'} I would rather not require that processors be required to detect the identity of the two sides of the choice in G6. I would also not like to forbid them to do so. {G9} name = A; A. A = ['a'-'z']. {input: 'p'} G9 illustrates the same principle on the level of nonterminals; formally, there is only one right-hand side for 'name', and the fact that it is written twice makes no more difference than writing the set of numbers from one to three as {1, 2, 3, 3} with four not three expressions within the braces. There is only one parse tree here (<name><A>p</A></name>) regardless of which branch of the choice one takes. {G10} name = ['a'-'z']+; ['a'-'z'; 'A' - 'Z']+. {input: 'p'} {G11} name = A; B. A = ['a'-'z']+. B = ['a'-'z'; 'A' - 'Z']+. {input: 'p'} Grammars G10 and G11 illustrate the proposition that it's not *just* a question of identifying identical right-hand sides: 'p' matches both left and right sides of the choice, but they are not identical. G10 also illustrates the fact that for formal language theory, ['a'-'z'] is not in fact a terminal symbol but just a shorthand for a twenty-six way choice. Formally, 'name' is defined here as name = 'a'. name = 'b'. ... name = 'z'. name = 'A'. ... name = 'Z'. So the input 'p' is not ambiguous against G10, from a formal point of view. The reason I wrote this note was to ask a question of those whose intuition is that the inputs specified should be regarded as ambiguous against G1, G6, and G10. Consider G12: {G12} name = ['a'-'z'; 'a'-'z'; 'A' - 'Z']+. {input: 'p'} Here there is only one terminal (as ixml defines the term), not two. Is your intuition that the input 'p' is ambiguous against grammar G12, or unambiguous? Michael -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Sunday, 12 June 2022 19:07:26 UTC