- From: Gregg Kellogg <gregg@kellogg-assoc.com>
- Date: Thu, 15 Dec 2011 22:13:40 -0500
- To: David Robillard <d@drobilla.net>
- CC: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>, Gavin Carothers <gavin@carothers.name>
On Dec 15, 2011, at 11:14 AM, David Robillard wrote:
> On Thu, 2011-12-15 at 13:20 -0500, Gregg Kellogg wrote:
>> On Dec 15, 2011, at 9:51 AM, "David Robillard" <d@drobilla.net> wrote:
>>
>>> On Thu, 2011-12-15 at 10:08 -0500, Gregg Kellogg wrote:
>>>> I believe that grammar rule [7] predicateObjectList [1] is not LL(1) and requires look ahead to know what branch to go into. For example:
>>>
>>> Turtle has never been LL(1).
>>>
>>> You need readahead for BooleanLiteral, since "true" or "false" could
>>> also be the start of a PrefixedName.
>>
>> Using white space to separate tokens where necessary has always been part of Turtle. Assuming this, Turtle (and SPARQL) is LL(1).
>
> I suppose you mean the parser must read a token at a time, and after
> reading an entire token can decide what rule applies. Fair enough, my
> implementation needing readahead in this case does not imply Turtle is
> not theoretically LL(1), my mistake.
>
> (Forgive my ignorance of common assumption/convention when using parser
> generators, I am assuming my feedback from having written hand-written a
> parser that very explicitly and directly maps to the grammar may be
> valuable)
>
> My issues admittedly stem from having originally implemented an earlier
> version of the spec that, among other things, did not separate terminals
> from non-terminal rules, and did not define what a "token" is at all. I
> guess only terminal rules define tokens and do *not* implicitly have
> inserted whitespace (whereas non-terminal rules are combinations of
> tokens which are inherently separated by whitespace). I do not see this
> defined in any document cited by the spec.
I believe the EBNF grammar defines productions and tokens separately, by convention, tokens are in CAPITAL CASE, but they can also be anything after the @terminals keyword. Whitespace by @pass, I believe. (see http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/turtle.bnf)
4.1 indicates that whitespace is used to separate two tokens that would otherwise be confused, which is rather vague, but common. This does allow for whitespace within tokens and IRI_REFs, but is not actually used in the grammar.
> Should it be precisely defined what constitues whitespace between
> tokens? There are many more unicode whitespace characters than the ws
> rule in the spec.
@pass provides a definition:
@pass ::= [ \t\r\n]+
| "#" [^\r\n]*
There's also a "ws" reference within the text of the spec, but this doesn't resolve.
>> My parser [1] is LL(1).
>
> How do you deal with quotes in long string literals without readahead?
Basically, I use a streaming tokenizer where the tokens are identified using regular expressions. Mine looks like the following:
STRING_LITERAL_LONG2 = /"""(?:(?:"|"")?(?:[^"\\]|#{ECHAR}|#{UCHAR}))*"""/m # [90s]
The only time the parser interacts with the tokenizer is during error recovery, where it just seeks forward until it finds a valid token based on the set of regular expressions identifying the terminal productions.
Gregg
> -dr
>
>
Received on Friday, 16 December 2011 03:14:38 UTC