Re: Does ixml have to match the whole input? from John Lumley on 2021-12-31 (public-ixml@w3.org from December 2021)

From: John Lumley <john@saxonica.com>
Date: Fri, 31 Dec 2021 14:20:13 +0000
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: Norm Tovey-Walsh <norm@saxonica.com>, ixml <public-ixml@w3.org>
Message-Id: <5E6438F0-37A4-4CFB-B975-4D916896F043@saxonica.com>
I haven’t the bandwidth to discuss at length at present (by a pool in Tenerife ;-), but in a sense ‘whitespace’ isn’t just another set of characters, they are characters that ‘do not make visible marks’ and in most human languages denote separation of tokens in linear variable-length encodings.
 Apart from the notion of a line-feed/carriage-return, the distinction between one and several consecutive whitespaces is sort of immaterial unless the grammar is especially particular. So, for practical uses we might need to consider semi-special treatment of whitespace, especially in start/end/tokenisation situations, perhaps by some predefined (pragmaed?) constructs enabling such a ‘tokenisation’ regime?

(Dives into pool to avoid the resulting roasting…)

Sent from my iPad

> On 31 Dec 2021, at 13:43, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
> On Friday 31 December 2021 11:16:49 (+01:00), Norm Tovey-Walsh wrote:
> 
> > Hello,
> > 
> > I feel like I saw mention of this recently, but can’t now put my hands
> > on the message where I saw it. Apologies for my failure to get this
> > message into the correct thread.
> > 
> > Consider this test from Steven:
> > 
> > a: "a", spaces, b.
> > b: spaces, "b".
> > spaces: " "*.
> > 
> > And the sample input file for that test:
> > 
> > a b
> > 
> > For clarity:
> > 
> > $ od -a tests/ambig3.inp
> > 0000000 a sp sp sp b nl
> > 0000006
> > 
> > I assert that the input does not match the grammar because there’s no
> > parse that allows the trailing newline character.
> 
> Correct. The section on conformance contains this constraint:
> In the normal case, when the input has a determinate length (either known in advance or signaled by some end-of-stream signal), the processor must by default parse the input in its entirety against the grammar and return either a parse tree or a failure document. Processors may provide user options for other behaviors (such as parsing the largest, or smallest, prefix of the input that is described by the grammar). Processors may also support invocation with input streams of indeterminate length.
> This was what I was referring to in my recent mail ('Change in live version of ixml processor' https://lists.w3.org/Archives/Public/public-ixml/2021Dec/0097):
> 
> This is a possible future discussion point:
> 
>  If a parse succeeds without using all the available input, should that be reported as a parse error, or as an ixml:state="incomplete" (or something similar)?
> 
> meaning that a parse had been found for the root symbol, but there were trailing characters after the parse.
> 
> But that mail was also pointing out that my processor used to do the wrong thing, and I had fixed it now. Some of the tests need to be updated accordingly, including the one mentioned above. (And I will be uploading the correct version today; in fact I did it after writing that sentence).
> 
> > We could say that it matches, with a trailing newline left over, but I’d
> > rather not. If we do, it’ll just introduce more variation in what the
> > processor has to consume and produce. If trailing whitespace is allowed,
> > why not leading whitespace? Why not both? Exactly one, or arbitrary
> > amounts? What if I want a grammar that *does* match leading and/or
> > trailing whitespace, etc. etc. etc.
> 
> It is important to note that "whitespace" is not a processing concept in ixml parsing. There are only characters. How those characters are interpreted is up to the ixml author.
> But it is easy to add
> 
>    root: ...stuff..., -lf?.
>    lf: -#a.
> 
> > The grammar could be updated to accept trailing newlines, or the user
> > could strip them off before attempting to parse. Either of those seems
> > preferable to saying that arbitrary left over characters at the ends are
> > ok.
> > 
> > With respect to the test suite, I’d be happy to say that all inputs
> > should have either all or exactly one trailing newline stripped off
> > before attempting to parse. Or not. A decent editor should allow you to
> > control whether or not a trailing newline occurs, it’s just a little
> > tedious to manage the distinction.
> 
> The tests should be correct wrt the spec. No trailing extra characters unless deliberate.
> 
> Steven
> 
> > 
> > Be seeing you,
> > norm
> > 
> > --
> > Norm Tovey-Walsh
> > Saxonica
> >
Received on Friday, 31 December 2021 14:20:35 UTC