W3C home > Mailing lists > Public > public-ixml@w3.org > April 2021

Re: [public-ixml] <none>

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Mon, 12 Apr 2021 11:23:51 -0600
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-ixml@w3.org
Message-Id: <EBE95F53-8E35-4E53-AEC5-425DA9CB9E69@blackmesatech.com>
To: Steven Pemberton <steven.pemberton@cwi.nl>


> On 12,Apr2021, at 3:25 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
> The new draft spec asks the question:
>   Should there be a requirement that the input is completely consumed?
> 
> I think the answer is yes.
> 
> Otherwise, if a grammar was:
> 
>   string: "a"*.
> 
> and the input was
> 
>   aaa
> 
> it would be permissible to produce as result:
> 
>   <string/>

I would have no objection to setting the expectation that any ixml processor should be able to work this way, or requiring that any ixml processor be able to work this way, but I would not like to require that they work only in this way.  And if that requirement is imposed, I guess Aparecium is likely to be non-conforming in that regard.

This may only be a question of what one is familiar with.  Further details below

Since that is one of the behaviors I’m used to from Prolog definite-clause grammars, that does not seem strange to me.  If I write this grammar in Prolog, as

    string --> star("a").

    /* Generic EBNF routine */
    star(N) --> []; N, star(N).

then asking Prolog to parse the string “aaa” and show me what’s left produces the following interaction, showing that the grammar can consume 0, 1, 2, or 3 characters.

    ?- phrase(string, "aaa", Remainder).
    Remainder = [a, a, a] ;
    Remainder = [a, a] ;
    Remainder = [a] ;
    Remainder = [] ;
    false.

(Digression, since I don’t suppose everyone who reads this regularly uses Prolog.  The string “Remainder = [a,a,a]” and similar strings on other lines are from the Prolog system; semicolons at the end of the line are the user asking for alternative solutions.  The final ‘false’ indicates that other than the four solutions listed, each of which make the goal expression true, there are no further solutions.)

So it feels completely sensible to me for an ixml parser to have a mode of operation that essentially means “Does this grammar match any prefix(es) of this input string?”

I don’t object to having a mode of operation that means “Does this grammar match this input string in its entirety?”  That is probably also the more usual way to invoke parsers in Prolog:  phrase/2 does not provide an output parameter for the remainder and succeeds only if there is no remainder (or, the remainder is the empty list of characters).  I.e. phrase(Nonterminal, Input) is defined as equal to phrase(Nonterminal, Input, []).

    ?- phrase(string, "aaa").
    true ;
    false.

(The true; false response signals that there is one way to parse this string, but not two.)

It is a little uncanny that this should come up today, since I think it was only yesterday or the day before that I tentatively decided to make Aparecium offer an invocation option that would report which prefixes of the input the grammar could consume.  At the moment it’s just pie in the sky, since I have other tasks to do before I can implement it.

On request, my plan is to make Aparecium return a parse-forest grammar; allowing that grammar to reflect parses that consume only part of the input is an easy change.  

In all-or-nothing mode, the parse forest grammar to be returned in this case  is simple:

    Goal: string_1_3.
    string_1_3: “a”, “a”, “a”.

In show-all-prefixes mode, the parse forest grammar will be slightly longer:

    Goal: string_1_3; string_1_2; string_1_1; string_0_0.
    string_1_3: “a”, “a”, “a”.
    string_1_2: “a”, “a”.
    string_1_1: “a”.
    string_1_0: .

In the parse-forest grammar, each nonterminal from the original grammar is extended with two affixes, one showing the offset at which the nonterminal in question starts (here always 1) and the second the length of the substring.  For input $i, a nonterminal n_x_y matches the string substring($i, $x, $y).

Of course, Aparecium will also have invocation options that provide the relevant parse trees as well.

Michael



********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************
Received on Monday, 12 April 2021 17:24:13 UTC

This archive was generated by hypermail 2.4.0 : Monday, 12 April 2021 17:24:14 UTC