- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Thu, 6 Jan 2022 17:03:04 -0700
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, Tomos Hillman <yamahito@gmail.com>, Norm Tovey-Walsh <norm@saxonica.com>, ixml <public-ixml@w3.org>
> On 6,Jan2022, at 3:03 PM, Steven Pemberton <steven.pemberton@cwi.nl> wrote: > > I honestly think you are overthinking this Michael. My apologies. To be blunt, I have the impression that the rest of the group is either underthinking it or not thinking about it at all. > All your long treatise goes to show is that there are different theories. There are different intuitions about how to apply concepts like derivation and parse in an EBNF context; I would not call them theories; that gives entirely the wrong impression of their status. If you learned nothing more from my mail than that, then I am very sorry to have wasted your time. > Algorithms based on those theories will therefore produce different results. Even if what we are looking at were different theories, it would not follow that algorithms will produce different results, any more than different correct parsing algorithms produce different parse trees for the same sentence and the same grammar. > But we are talking about a tiny corner of any language, and in all cases the serialisation will be the same. We don't even require that a parser discover all possible parses, as long as it finds one, in which case it would never report ambiguity. > So I still stand by the current wording: > > > > * It must find at least one parse of any input that matches the grammar > > > * if it finds more than one parse, it must report that fact. Is that the current wording? I cannot find it in the spec. What I find in the spec is rather different: > If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting parse must be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation. and > If more than one parse tree describes the input, the processor must serialize one of them. It is not defined how this choice is made, but the resulting parse should be marked as ambiguous by including on the document element of the serialisation the attribute ixml:state="ambiguous", unless the user has activated an option to suppress this attribute. There are, I think, a few issues here. First, the two quotations disagree over whether a processor MUST or SHOULD report ambiguity. Second, your paraphrase and the two quotations from the spec offer three different descriptions of when a report of ambiguity is in order: - if the processor finds more than one parse? - if more than one parse tree describes the input? - if more than one parse "results"? (is that the same as the processor finding more than one? or the same as more than one existing?) The difference is important; consider a backtracking parser that has found one parse tree. If it is operating in strict conformance mode, can it return that parse tree? Which of the following describes the situation? - It has not found more than one parse, so it need not (and should not) report that the sentence is ambiguous? - It does not know whether more than one parse tree describes the input, so it does not currently know whether the sentence is ambiguous or not. To be sure, it should look to see whether it can find a second parse tree. If it does, then the sentence is ambiguous and the parse tree returned should be so marked. Finally, both in the current spec and in your paraphrase, the terms “parse” and “parse tree” are undefined, and so there is really no way to be sure what counts, for purposes of the spec, as “more than one parse” or “more than one parse tree”. The references to “parse trees” elsewhere in the spec suggest that it refers to the XML structure output by a conforming ixml processor: > A grammar is used to describe the input format. An input is parsed using this grammar, and the resulting parse tree is serialised as XML. Processors must > parse the input using the grammar specified, and produce an XML document representing a parse tree for the input If the parse trees referred to in the rules relating to ixml:state are the XML documents to be returned, then there no question about it: the empty string has only one XML representation in our example, and the correct tree to return is <S/>, not <S ixml:state=“ambiguous”/> But I believe that several people have expressed an unwillingness to apply the “more than one” to the XML form being produced, and there are good reasons to be cautious. I don’t see a compelling reason for us not to have a clear story here, even though it may require that we think hard for a bit. Michael
Received on Friday, 7 January 2022 00:03:23 UTC