Re: [public-ixml] <none> from C. M. Sperberg-McQueen on 2021-04-13 (public-ixml@w3.org from April 2021)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Tue, 13 Apr 2021 10:21:00 -0600
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-ixml@w3.org
Message-Id: <6468CFFA-FEDF-42DB-A41C-F364D95EEC01@blackmesatech.com>
I don’t think I’m comparing anything at all, let alone comparing different things.  Perhaps we are talking past each other.

Given an input of length N, and a grammar G, it is obviously a useful thing to be able to ask that an ixml parser see whether the string starting at character 1 and ending at character N — i.e., in XPath notation, substring($input, 1, $N) — is a sentence in the language defined by G, and if so to return an XML document representing a suitable parse tree for that string against that grammar.

It seems to me that it can also be a useful thing to be able to ask an ixml parser what substrings beginning at character 1 of the input are sentences in L(G), and what their parse trees might be.  

If there is a requirement in the spec that the input be completely consumed, the second interface seems to be defined a priori as non-conforming.  I don’t see any reason to make it non-conforming and I would prefer that we not do so.  

Also, if there is a requirement in the spec that the input be completely consumed, it seems to be a consequence that an ixml parser cannot be used to parse inputs of indeterminate length, such as input on a stream that does not have an announced length and may never end (for some suitably lax definition of ’never’).

I wonder if words like the following would work:

    In the normal case, the input will have a deterministic length,
    either known in advance or signaled by some end-of-stream signal.

    In that case, the default behavior of an ixml parser shall be to
    parse the input as a whole against the grammar, and return a parse
    or a failure document as described elsewhere.

    Parsers may also support the case of input of non-deterministic
    length, by parsing successive prefixes of the input.

    Parsers may also offer, at user option, to parse prefixes of the
    input even if the input has deterministic length.

This wording is rough — I have not consulted the current spec to see what words it uses for what is described here as ‘returning a parse’, or ‘returning a failure document’, but the key ideas the words just given attempt to get across are (1) that the spec requires that a conforming parser, given a conventional file as input, should (unless otherwise instructed by choosing some non-default option) parse the entire input successfully or else return a failure signal of some kind and (2) that the spec does not require or prohibit other interfaces being offered.

In particular, the spec should neither require nor forbid support for ‘infinite streams’ of input, and it should neither require nor forbid support for the kind of non-deterministic parse behavior exhibited by Prolog parsers.  

I have the sense that the group as a whole leans towards the view that the spec should require conforming parsers to support the case of determinate-length input and the case of parsing all of the input.  I don’t see any compelling argument against those requirements; the only worry I have is that we might end up requiring support for those cases by prohibiting support for others.

Michael


> On 13,Apr2021, at 7:54 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
> I really think you are comparing different things.
> An empty file is a syntactically correct C program. If I fed a non-empty C file into ixml, and it said it was correct, and the serialisation was <cprogram/> I would be extremely unhappy.
> 
> Similarly if it consisted of two functions, and I only got the serialisation of the first.
> 
> I can think of no sensible use-case where I would want ixml to stop at an initial conforming string, and not consume the whole of the input, except in the case that the remaining input was non-conforming, in which case I would want an error message.
> 
> Steven
> 
> On Monday 12 April 2021 19:23:51 (+02:00), C. M. Sperberg-McQueen wrote:
> 
>> 
>> 
>>> On 12,Apr2021, at 3:25 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
>>> 
>>> The new draft spec asks the question:
>>> Should there be a requirement that the input is completely consumed?
>>> 
>>> I think the answer is yes.
>>> 
>>> Otherwise, if a grammar was:
>>> 
>>> string: "a"*.
>>> 
>>> and the input was
>>> 
>>> aaa
>>> 
>>> it would be permissible to produce as result:
>>> 
>>> <string/>
>> 
>> I would have no objection to setting the expectation that any ixml processor should be able to work this way, or requiring that any ixml processor be able to work this way, but I would not like to require that they work only in this way. And if that requirement is imposed, I guess Aparecium is likely to be non-conforming in that regard.
>> 
>> This may only be a question of what one is familiar with. Further details below
>> 
>> Since that is one of the behaviors I’m used to from Prolog definite-clause grammars, that does not seem strange to me. If I write this grammar in Prolog, as
>> 
>> string --> star("a").
>> 
>> /* Generic EBNF routine */
>> star(N) --> []; N, star(N).
>> 
>> then asking Prolog to parse the string “aaa” and show me what’s left produces the following interaction, showing that the grammar can consume 0, 1, 2, or 3 characters.
>> 
>> ?- phrase(string, "aaa", Remainder).
>> Remainder = [a, a, a] ;
>> Remainder = [a, a] ;
>> Remainder = [a] ;
>> Remainder = [] ;
>> false.
>> 
>> (Digression, since I don’t suppose everyone who reads this regularly uses Prolog. The string “Remainder = [a,a,a]” and similar strings on other lines are from the Prolog system; semicolons at the end of the line are the user asking for alternative solutions. The final ‘false’ indicates that other than the four solutions listed, each of which make the goal expression true, there are no further solutions.)
>> 
>> So it feels completely sensible to me for an ixml parser to have a mode of operation that essentially means “Does this grammar match any prefix(es) of this input string?”
>> 
>> I don’t object to having a mode of operation that means “Does this grammar match this input string in its entirety?” That is probably also the more usual way to invoke parsers in Prolog: phrase/2 does not provide an output parameter for the remainder and succeeds only if there is no remainder (or, the remainder is the empty list of characters). I.e. phrase(Nonterminal, Input) is defined as equal to phrase(Nonterminal, Input, []).
>> 
>> ?- phrase(string, "aaa").
>> true ;
>> false.
>> 
>> (The true; false response signals that there is one way to parse this string, but not two.)
>> 
>> It is a little uncanny that this should come up today, since I think it was only yesterday or the day before that I tentatively decided to make Aparecium offer an invocation option that would report which prefixes of the input the grammar could consume. At the moment it’s just pie in the sky, since I have other tasks to do before I can implement it.
>> 
>> On request, my plan is to make Aparecium return a parse-forest grammar; allowing that grammar to reflect parses that consume only part of the input is an easy change.
>> 
>> In all-or-nothing mode, the parse forest grammar to be returned in this case is simple:
>> 
>> Goal: string_1_3.
>> string_1_3: “a”, “a”, “a”.
>> 
>> In show-all-prefixes mode, the parse forest grammar will be slightly longer:
>> 
>> Goal: string_1_3; string_1_2; string_1_1; string_0_0.
>> string_1_3: “a”, “a”, “a”.
>> string_1_2: “a”, “a”.
>> string_1_1: “a”.
>> string_1_0: .
>> 
>> In the parse-forest grammar, each nonterminal from the original grammar is extended with two affixes, one showing the offset at which the nonterminal in question starts (here always 1) and the second the length of the substring. For input $i, a nonterminal n_x_y matches the string substring($i, $x, $y).
>> 
>> Of course, Aparecium will also have invocation options that provide the relevant parse trees as well.
>> 
>> Michael
>> 
>> 
>> 
>> ********************************************
>> C. M. Sperberg-McQueen
>> Black Mesa Technologies LLC
>> cmsmcq@blackmesatech.com
>> http://www.blackmesatech.com
>> ********************************************
>> 
>> 
>> 
> 

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************
Received on Tuesday, 13 April 2021 16:21:23 UTC