Re: Is this ambiguous?

Steven Pemberton writes:

> The text in the spec is


> "If more than one parse tree describes the input, the processor must
> serialize one of them. It is not defined how this choice is made, but
> the resulting parse should be marked as ambiguous"

I note that in the section on Parsing, the wording is slightly
different:

    If more than one parse results, one is chosen; it is not defined how
    this choice is made, but the resulting parse should be marked as
    ambiguous by including the attribute ixml:state="ambiguous" on the
    document element of the serialisation.

Each wording has a little wiggle room:

  * The passage from the "Parsing" section talks about parses that
    "result" from the processor's activity. That seems to leave open the
    possibility that a conforming processor might find one tree, not
    find a second tree because it doesn't look for one, and report the
    one it found.

  * It also seems to leave open the possibility that two processors
    might rewrite the grammar into BNF in different ways, with one
    processor's choices resulting in multiple parse trees and the other
    processor's choices not resulting in multiple parse trees.
    (Example: S := 'a'*; 'b'*. can be rewritten in BNF in at least one
    way that makes the empty string ambiguous, and in at least one way
    that makes it unambiguous.

    Since I do not wish to prescribe that grammars must be rewritten to
    BNF before parsing, and do not wish to prescribe a particular
    rewriting, leaving things open for inter-processor variation seems
    attractive to me.

  * The passage from the conformance section appears to rule out
    processor-dependent detection of the ambiguity:  it doesn't talk
    about multiple parse trees resulting from the processor's attempts
    to parse the input, but in purely declarative terms about the
    existence of multiple parse trees describing the input.

    But the only occurences of "must" in that list item are that
    processors must serialize one of the trees, and that the default
    mode of operation must be to produce one, not more than one, tree,
    if any parse trees exist.  The provision of the ambiguity flag is a
    "should", not a "must".

If the wiggle room identified is not intended, we may need to adjust the
wording to eliminate it.  If the wiggle room is intended, we may wish to
adjust the wording to make it clearer what will and what won't vary
among conforming processors.

Another source of wiggle room easily spotted in this connection is that
the term "parse tree" is not defined.

At least three meanings may be imagined:

    * The XML representation specified by the ixml grammar used.  (Some
      passages in the spec use the term in this way.)

    * A parse tree whose interior nodes are all labeled with
      nonterminals in the ixml grammar.

    * A parse tree whose interior nodes are all labeled either with
      nonterminals in the ixml grammar or with nonterminals in a grammar
      derived from it (e.g. by the rewritings described in the "Hints
      for Implementers" section).

In discussions of ambiguity, I think some CG members have appealed
implicitly to a fourth meaning:

    * A parse tree whose interior nodes are all labeled either with
      nonterminals in the ixml grammar, or with nonterminals in a
      grammar derived from it (e.g. by the rewritings described in the
      "Hints for Implementers" section), or with sub-expressions from
      right-hand sides in the grammar (e.g. the choice "a"* or the
      choice "b"* -- any alt in a set of alts, ...)

This has been accompanied by a certain amount of hand-waving, but I
think it could be made more precise by formulating a set of rewrite
rules which would map from the ixml grammar supplied by the user to a
(BNF) grammar in which there are nonterminals for all the potential
interior nodes in all potential parse trees.  I thus think it's probably
best treated as an instance of the third bullet identified above.

Perhaps it would be worth re-writing the second paragraph of the section
headed "Parsing" to read

    Processors must accept and parse any conforming grammar, and produce
    at least one parse of any input that matches the grammar starting at
    the root symbol.

    If more than one parse results, one is chosen; it is not defined how
    this choice is made, but the resulting parse should be marked as
    ambiguous by including the attribute ixml:state="ambiguous" on the
    document element of the serialisation. The ixml namespace URI is
    "http://invisiblexml.org/NS".

    Note that if some nonterminals are marked "-", different parse trees
    generated by the underlying grammar may produce the same XML
    representation.  The details of ambiguity detection and marking are
    implementation-dependent.

The first two paragraphs here are unchanged from the existing text' the
changes here are (a) the introduction of a paragraph break at "If more"
and (b) the additional material beginning at "Note that".

Michael


> Steven
>
> On Monday 14 February 2022 16:11:29 (+01:00), Steven Pemberton wrote:
>
>> Sure it is. Ambiguity is in the input, not in the output. It's why we go to such lengths about spaces in the ixml grammar.
>>
>> I call it "harmless" ambiguity, but spotting if a parse is ambiguous is easy (it's just a tree walk); spotting whether it is harmless is not easy: you have to compare the generated output from all the possible ambiguous parse trees, and there may be a lot of them.
>>
>> Ambiguous grammars also slow parse time down, sometimes by a lot.
>>
>> Steven
>>
>> On Sunday 13 February 2022 11:33:56 (+01:00), Norm Tovey-Walsh wrote:
>>
>> > Hello world,
>> >
>> > Consider this grammar:
>> >
>> > list: word + -',' .
>> > word: c, v, c ; c, v, v .
>> > -c: ["bcdfghjklmnpqrstvwxyz"] .
>> > -v: ["aeiouy" ].
>> >
>> > and this input: “hey,bee”.
>> >
>> > By one reckoning, that’s ambiguous: “hey” can be either cvc or cvv
>> > because I’ve identified “y” as both a consonant and a vowel.
>> >
>> > But “c” and “v” are both elided from the output, so the generated XML is
>> > identical for both parses. By that reckoning, it isn’t ambiguous.
>> >
>> > I expect our intent is that it *isn’t* ambiguous…but I thought I’d check.
>> >
>> > Be seeing you,
>> > norm
>> >
>> > --
>> > Norm Tovey-Walsh
>> > Saxonica
>> >


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Monday, 14 February 2022 19:47:19 UTC