- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Mon, 14 Feb 2022 12:47:00 -0700
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: Norm Tovey-Walsh <norm@saxonica.com>, public-ixml@w3.org
Steven Pemberton writes:
> The text in the spec is
> "If more than one parse tree describes the input, the processor must
> serialize one of them. It is not defined how this choice is made, but
> the resulting parse should be marked as ambiguous"
I note that in the section on Parsing, the wording is slightly
different:
If more than one parse results, one is chosen; it is not defined how
this choice is made, but the resulting parse should be marked as
ambiguous by including the attribute ixml:state="ambiguous" on the
document element of the serialisation.
Each wording has a little wiggle room:
* The passage from the "Parsing" section talks about parses that
"result" from the processor's activity. That seems to leave open the
possibility that a conforming processor might find one tree, not
find a second tree because it doesn't look for one, and report the
one it found.
* It also seems to leave open the possibility that two processors
might rewrite the grammar into BNF in different ways, with one
processor's choices resulting in multiple parse trees and the other
processor's choices not resulting in multiple parse trees.
(Example: S := 'a'*; 'b'*. can be rewritten in BNF in at least one
way that makes the empty string ambiguous, and in at least one way
that makes it unambiguous.
Since I do not wish to prescribe that grammars must be rewritten to
BNF before parsing, and do not wish to prescribe a particular
rewriting, leaving things open for inter-processor variation seems
attractive to me.
* The passage from the conformance section appears to rule out
processor-dependent detection of the ambiguity: it doesn't talk
about multiple parse trees resulting from the processor's attempts
to parse the input, but in purely declarative terms about the
existence of multiple parse trees describing the input.
But the only occurences of "must" in that list item are that
processors must serialize one of the trees, and that the default
mode of operation must be to produce one, not more than one, tree,
if any parse trees exist. The provision of the ambiguity flag is a
"should", not a "must".
If the wiggle room identified is not intended, we may need to adjust the
wording to eliminate it. If the wiggle room is intended, we may wish to
adjust the wording to make it clearer what will and what won't vary
among conforming processors.
Another source of wiggle room easily spotted in this connection is that
the term "parse tree" is not defined.
At least three meanings may be imagined:
* The XML representation specified by the ixml grammar used. (Some
passages in the spec use the term in this way.)
* A parse tree whose interior nodes are all labeled with
nonterminals in the ixml grammar.
* A parse tree whose interior nodes are all labeled either with
nonterminals in the ixml grammar or with nonterminals in a grammar
derived from it (e.g. by the rewritings described in the "Hints
for Implementers" section).
In discussions of ambiguity, I think some CG members have appealed
implicitly to a fourth meaning:
* A parse tree whose interior nodes are all labeled either with
nonterminals in the ixml grammar, or with nonterminals in a
grammar derived from it (e.g. by the rewritings described in the
"Hints for Implementers" section), or with sub-expressions from
right-hand sides in the grammar (e.g. the choice "a"* or the
choice "b"* -- any alt in a set of alts, ...)
This has been accompanied by a certain amount of hand-waving, but I
think it could be made more precise by formulating a set of rewrite
rules which would map from the ixml grammar supplied by the user to a
(BNF) grammar in which there are nonterminals for all the potential
interior nodes in all potential parse trees. I thus think it's probably
best treated as an instance of the third bullet identified above.
Perhaps it would be worth re-writing the second paragraph of the section
headed "Parsing" to read
Processors must accept and parse any conforming grammar, and produce
at least one parse of any input that matches the grammar starting at
the root symbol.
If more than one parse results, one is chosen; it is not defined how
this choice is made, but the resulting parse should be marked as
ambiguous by including the attribute ixml:state="ambiguous" on the
document element of the serialisation. The ixml namespace URI is
"http://invisiblexml.org/NS".
Note that if some nonterminals are marked "-", different parse trees
generated by the underlying grammar may produce the same XML
representation. The details of ambiguity detection and marking are
implementation-dependent.
The first two paragraphs here are unchanged from the existing text' the
changes here are (a) the introduction of a paragraph break at "If more"
and (b) the additional material beginning at "Note that".
Michael
> Steven
>
> On Monday 14 February 2022 16:11:29 (+01:00), Steven Pemberton wrote:
>
>> Sure it is. Ambiguity is in the input, not in the output. It's why we go to such lengths about spaces in the ixml grammar.
>>
>> I call it "harmless" ambiguity, but spotting if a parse is ambiguous is easy (it's just a tree walk); spotting whether it is harmless is not easy: you have to compare the generated output from all the possible ambiguous parse trees, and there may be a lot of them.
>>
>> Ambiguous grammars also slow parse time down, sometimes by a lot.
>>
>> Steven
>>
>> On Sunday 13 February 2022 11:33:56 (+01:00), Norm Tovey-Walsh wrote:
>>
>> > Hello world,
>> >
>> > Consider this grammar:
>> >
>> > list: word + -',' .
>> > word: c, v, c ; c, v, v .
>> > -c: ["bcdfghjklmnpqrstvwxyz"] .
>> > -v: ["aeiouy" ].
>> >
>> > and this input: “hey,bee”.
>> >
>> > By one reckoning, that’s ambiguous: “hey” can be either cvc or cvv
>> > because I’ve identified “y” as both a consonant and a vowel.
>> >
>> > But “c” and “v” are both elided from the output, so the generated XML is
>> > identical for both parses. By that reckoning, it isn’t ambiguous.
>> >
>> > I expect our intent is that it *isn’t* ambiguous…but I thought I’d check.
>> >
>> > Be seeing you,
>> > norm
>> >
>> > --
>> > Norm Tovey-Walsh
>> > Saxonica
>> >
--
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Monday, 14 February 2022 19:47:19 UTC