Re: predicateObjectList rule requires lookahead from Gregg Kellogg on 2011-12-16 (public-rdf-comments@w3.org from December 2011)

From: Gregg Kellogg <gregg@kellogg-assoc.com>
Date: Thu, 15 Dec 2011 22:13:40 -0500
To: David Robillard <d@drobilla.net>
CC: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>, Gavin Carothers <gavin@carothers.name>
Message-ID: <9D57562D-0F40-4BA7-9DC0-527C0F78CA4D@kellogg-assoc.com>

On Dec 15, 2011, at 11:14 AM, David Robillard wrote:

> On Thu, 2011-12-15 at 13:20 -0500, Gregg Kellogg wrote:
>> On Dec 15, 2011, at 9:51 AM, "David Robillard" <d@drobilla.net> wrote:
>> 
>>> On Thu, 2011-12-15 at 10:08 -0500, Gregg Kellogg wrote:
>>>> I believe that grammar rule [7] predicateObjectList [1] is not LL(1) and requires look ahead to know what branch to go into. For example:
>>> 
>>> Turtle has never been LL(1).
>>> 
>>> You need readahead for BooleanLiteral, since "true" or "false" could
>>> also be the start of a PrefixedName.
>> 
>> Using white space to separate tokens where necessary has always been part of Turtle. Assuming this, Turtle (and SPARQL) is LL(1).
> 
> I suppose you mean the parser must read a token at a time, and after
> reading an entire token can decide what rule applies.  Fair enough, my
> implementation needing readahead in this case does not imply Turtle is
> not theoretically LL(1), my mistake.
> 
> (Forgive my ignorance of common assumption/convention when using parser
> generators, I am assuming my feedback from having written hand-written a
> parser that very explicitly and directly maps to the grammar may be
> valuable)
> 
> My issues admittedly stem from having originally implemented an earlier
> version of the spec that, among other things, did not separate terminals
> from non-terminal rules, and did not define what a "token" is at all.  I
> guess only terminal rules define tokens and do *not* implicitly have
> inserted whitespace (whereas non-terminal rules are combinations of
> tokens which are inherently separated by whitespace).  I do not see this
> defined in any document cited by the spec.

I believe the EBNF grammar defines productions and tokens separately, by convention, tokens are in CAPITAL CASE, but they can also be anything after the @terminals keyword. Whitespace by @pass, I believe. (see http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/turtle.bnf)

4.1 indicates that whitespace is used to separate two tokens that would otherwise be confused, which is rather vague, but common. This does allow for whitespace within tokens and IRI_REFs, but is not actually used in the grammar.

> Should it be precisely defined what constitues whitespace between
> tokens? There are many more unicode whitespace characters than the ws
> rule in the spec.

@pass provides a definition:

@pass ::= [ \t\r\n]+ 
 | "#" [^\r\n]*

There's also a "ws" reference within the text of the spec, but this doesn't resolve.

>> My parser [1] is LL(1).
> 
> How do you deal with quotes in long string literals without readahead?

Basically, I use a streaming tokenizer where the tokens are identified using regular expressions. Mine looks like the following:

    STRING_LITERAL_LONG2 = /"""(?:(?:"|"")?(?:[^"\\]|#{ECHAR}|#{UCHAR}))*"""/m    # [90s]

The only time the parser interacts with the tokenizer is during error recovery, where it just seeks forward until it finds a valid token based on the set of regular expressions identifying the terminal productions.

Gregg

> -dr
> 
>

Received on Friday, 16 December 2011 03:14:38 UTC