Re: predicateObjectList rule requires lookahead from Gavin Carothers on 2011-12-18 (public-rdf-comments@w3.org from December 2011)

From: Gavin Carothers <gavin@carothers.name>
Date: Sun, 18 Dec 2011 15:21:29 -0800
To: Gregg Kellogg <gregg@kellogg-assoc.com>
Cc: David Robillard <d@drobilla.net>, "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <CAPqY83x5bdtnBjcnU9uwia8fRrGL-L_bydj3w+6kVNkYJ8vzuQ@mail.gmail.com>

On Thu, Dec 15, 2011 at 7:13 PM, Gregg Kellogg <gregg@kellogg-assoc.com> wrote:
> On Dec 15, 2011, at 11:14 AM, David Robillard wrote:
>
>> On Thu, 2011-12-15 at 13:20 -0500, Gregg Kellogg wrote:
>>> On Dec 15, 2011, at 9:51 AM, "David Robillard" <d@drobilla.net> wrote:
>>>
>>>> On Thu, 2011-12-15 at 10:08 -0500, Gregg Kellogg wrote:
>>>>> I believe that grammar rule [7] predicateObjectList [1] is not LL(1) and requires look ahead to know what branch to go into. For example:
>>>>
>>>> Turtle has never been LL(1).
>>>>
>>>> You need readahead for BooleanLiteral, since "true" or "false" could
>>>> also be the start of a PrefixedName.
>>>
>>> Using white space to separate tokens where necessary has always been part of Turtle. Assuming this, Turtle (and SPARQL) is LL(1).
>>
>> I suppose you mean the parser must read a token at a time, and after
>> reading an entire token can decide what rule applies.  Fair enough, my
>> implementation needing readahead in this case does not imply Turtle is
>> not theoretically LL(1), my mistake.
>>
>> (Forgive my ignorance of common assumption/convention when using parser
>> generators, I am assuming my feedback from having written hand-written a
>> parser that very explicitly and directly maps to the grammar may be
>> valuable)
>>
>> My issues admittedly stem from having originally implemented an earlier
>> version of the spec that, among other things, did not separate terminals
>> from non-terminal rules, and did not define what a "token" is at all.  I
>> guess only terminal rules define tokens and do *not* implicitly have
>> inserted whitespace (whereas non-terminal rules are combinations of
>> tokens which are inherently separated by whitespace).  I do not see this
>> defined in any document cited by the spec.
>
> I believe the EBNF grammar defines productions and tokens separately, by convention, tokens are in CAPITAL CASE, but they can also be anything after the @terminals keyword. Whitespace by @pass, I believe. (see http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/turtle.bnf)
>
> 4.1 indicates that whitespace is used to separate two tokens that would otherwise be confused, which is rather vague, but common. This does allow for whitespace within tokens and IRI_REFs, but is not actually used in the grammar.
>
>> Should it be precisely defined what constitues whitespace between
>> tokens? There are many more unicode whitespace characters than the ws
>> rule in the spec.
>
> @pass provides a definition:
>
> @pass ::= [ \t\r\n]+
>  | "#" [^\r\n]*
>
> There's also a "ws" reference within the text of the spec, but this doesn't resolve.

WS is specifically for two terminals that allow for ONLY whitespace inside them:

[92s] NIL ::= "(" (WS)* ")"

[93s] WS ::= " "
 | "\t"
 | "\r"
 | "\n"
[94s] ANON ::= "[" (WS)* "]"

>
>>> My parser [1] is LL(1).
>>
>> How do you deal with quotes in long string literals without readahead?
>
> Basically, I use a streaming tokenizer where the tokens are identified using regular expressions. Mine looks like the following:
>
>    STRING_LITERAL_LONG2 = /"""(?:(?:"|"")?(?:[^"\\]|#{ECHAR}|#{UCHAR}))*"""/m    # [90s]
>
> The only time the parser interacts with the tokenizer is during error recovery, where it just seeks forward until it finds a valid token based on the set of regular expressions identifying the terminal productions.
>
> Gregg
>
>> -dr
>>
>>
>

Received on Sunday, 18 December 2011 23:21:57 UTC