- From: Gavin Carothers <gavin@carothers.name>
- Date: Sun, 18 Dec 2011 15:21:29 -0800
- To: Gregg Kellogg <gregg@kellogg-assoc.com>
- Cc: David Robillard <d@drobilla.net>, "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
On Thu, Dec 15, 2011 at 7:13 PM, Gregg Kellogg <gregg@kellogg-assoc.com> wrote: > On Dec 15, 2011, at 11:14 AM, David Robillard wrote: > >> On Thu, 2011-12-15 at 13:20 -0500, Gregg Kellogg wrote: >>> On Dec 15, 2011, at 9:51 AM, "David Robillard" <d@drobilla.net> wrote: >>> >>>> On Thu, 2011-12-15 at 10:08 -0500, Gregg Kellogg wrote: >>>>> I believe that grammar rule [7] predicateObjectList [1] is not LL(1) and requires look ahead to know what branch to go into. For example: >>>> >>>> Turtle has never been LL(1). >>>> >>>> You need readahead for BooleanLiteral, since "true" or "false" could >>>> also be the start of a PrefixedName. >>> >>> Using white space to separate tokens where necessary has always been part of Turtle. Assuming this, Turtle (and SPARQL) is LL(1). >> >> I suppose you mean the parser must read a token at a time, and after >> reading an entire token can decide what rule applies. Fair enough, my >> implementation needing readahead in this case does not imply Turtle is >> not theoretically LL(1), my mistake. >> >> (Forgive my ignorance of common assumption/convention when using parser >> generators, I am assuming my feedback from having written hand-written a >> parser that very explicitly and directly maps to the grammar may be >> valuable) >> >> My issues admittedly stem from having originally implemented an earlier >> version of the spec that, among other things, did not separate terminals >> from non-terminal rules, and did not define what a "token" is at all. I >> guess only terminal rules define tokens and do *not* implicitly have >> inserted whitespace (whereas non-terminal rules are combinations of >> tokens which are inherently separated by whitespace). I do not see this >> defined in any document cited by the spec. > > I believe the EBNF grammar defines productions and tokens separately, by convention, tokens are in CAPITAL CASE, but they can also be anything after the @terminals keyword. Whitespace by @pass, I believe. (see http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/turtle.bnf) > > 4.1 indicates that whitespace is used to separate two tokens that would otherwise be confused, which is rather vague, but common. This does allow for whitespace within tokens and IRI_REFs, but is not actually used in the grammar. > >> Should it be precisely defined what constitues whitespace between >> tokens? There are many more unicode whitespace characters than the ws >> rule in the spec. > > @pass provides a definition: > > @pass ::= [ \t\r\n]+ > | "#" [^\r\n]* > > There's also a "ws" reference within the text of the spec, but this doesn't resolve. WS is specifically for two terminals that allow for ONLY whitespace inside them: [92s] NIL ::= "(" (WS)* ")" [93s] WS ::= " " | "\t" | "\r" | "\n" [94s] ANON ::= "[" (WS)* "]" > >>> My parser [1] is LL(1). >> >> How do you deal with quotes in long string literals without readahead? > > Basically, I use a streaming tokenizer where the tokens are identified using regular expressions. Mine looks like the following: > > STRING_LITERAL_LONG2 = /"""(?:(?:"|"")?(?:[^"\\]|#{ECHAR}|#{UCHAR}))*"""/m # [90s] > > The only time the parser interacts with the tokenizer is during error recovery, where it just seeks forward until it finds a valid token based on the set of regular expressions identifying the terminal productions. > > Gregg > >> -dr >> >> >
Received on Sunday, 18 December 2011 23:21:57 UTC