Re: predicateObjectList rule requires lookahead from Andy Seaborne on 2011-12-15 (public-rdf-comments@w3.org from December 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Thu, 15 Dec 2011 18:44:23 +0000
To: public-rdf-comments@w3.org
Message-ID: <4EEA4007.90903@epimorphics.com>

On 15/12/11 18:20, Gregg Kellogg wrote:
> On Dec 15, 2011, at 9:51 AM, "David Robillard"<d@drobilla.net>  wrote:
>
>> On Thu, 2011-12-15 at 10:08 -0500, Gregg Kellogg wrote:
>>> I believe that grammar rule [7] predicateObjectList [1] is not LL(1) and requires look ahead to know what branch to go into. For example:
>>
>> Turtle has never been LL(1).
>>
>> You need readahead for BooleanLiteral, since "true" or "false" could
>> also be the start of a PrefixedName.

Prefix names vs true/false is (usually) done in tokenizering, not the 
grammar.  Different technology.  lex / yacc.

It might be better to have

<TRUE> := "true"

but, conventionally, strings constants are assumed to be tokens, the 
things in <...>.

By the time it gets to the "grammar" it's a token that's the <true>, 
<false>,  or <PNAME_NS> or <PNAME_LN>.

And it's why 123 is a number and not three numbers.  The tokenizer is 
greedy.

>
> Using white space to separate tokens where necessary has always been part of Turtle. Assuming this, Turtle (and SPARQL) is LL(1).
>
> My parser [1] is LL(1).

Ditto (javacc, no lookahead tricks), and one also a recursive decent parser.

There is a LL(1) grammar for the Turtle language; the spec gives a 
grammar for the language but it's not the only possible grammar for the 
language.

>
> Gregg
>
> 1: http://github.com/rdf-turtle

http://github.com/gkellogg/rdf-turtle

>
>> This is the worst case, 6 character readahead.

Yes, but usually it's a token/grammar split and that's the tokenizer 
building up it's state before returning a classified tokens.  Turtle is 
quite simple as a set of tokens.

You may be able to tokenize on single characters, and build a grammar 
for the language based on that, but madness may be the result.  It 
certainly isn't the grammar for the language in the spec.

	Andy

>>
>> Similarly,
>>
>> [9] verb ::= predicate | 'a'
>>
>> Requires a 2 character readahead (to check if the 'a' is followed by
>> whitespace since 'a' can start a predicate.
>>
>> In general, qualified names and keywords are ambiguous while parsing.
>> IMO either qualified names should have had quoting ("[foo:bar]",
>> perhaps), or the special keywords ("a", "true", "false") should have had
>> a unique prefix character, which would solve this problem and make the
>> grammar extensible, perhaps even 'dynamically' via a @keyword directive.
>> It's too late for that now, however.
>>
>> I also had to use it in my parser to correctly handle quote characters
>> in long string literals, since you can read up to 2 of them and have it
>> not terminate the string, i.e. every time you encounter a quote you must
>> read ahead 3 characters to determine if this is the end of the string
>> literal.  I don't see how this could have been avoided, other than
>> simply making single quote strings be long literals, but this would have
>> meant quotes would always need escaping in a string literal.
>>
>> -dr
>>
>>
>

Received on Thursday, 15 December 2011 18:44:56 UTC