- From: Gregg Kellogg <gregg@kellogg-assoc.com>
- Date: Thu, 15 Dec 2011 22:13:40 -0500
- To: David Robillard <d@drobilla.net>
- CC: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>, Gavin Carothers <gavin@carothers.name>
On Dec 15, 2011, at 11:14 AM, David Robillard wrote: > On Thu, 2011-12-15 at 13:20 -0500, Gregg Kellogg wrote: >> On Dec 15, 2011, at 9:51 AM, "David Robillard" <d@drobilla.net> wrote: >> >>> On Thu, 2011-12-15 at 10:08 -0500, Gregg Kellogg wrote: >>>> I believe that grammar rule [7] predicateObjectList [1] is not LL(1) and requires look ahead to know what branch to go into. For example: >>> >>> Turtle has never been LL(1). >>> >>> You need readahead for BooleanLiteral, since "true" or "false" could >>> also be the start of a PrefixedName. >> >> Using white space to separate tokens where necessary has always been part of Turtle. Assuming this, Turtle (and SPARQL) is LL(1). > > I suppose you mean the parser must read a token at a time, and after > reading an entire token can decide what rule applies. Fair enough, my > implementation needing readahead in this case does not imply Turtle is > not theoretically LL(1), my mistake. > > (Forgive my ignorance of common assumption/convention when using parser > generators, I am assuming my feedback from having written hand-written a > parser that very explicitly and directly maps to the grammar may be > valuable) > > My issues admittedly stem from having originally implemented an earlier > version of the spec that, among other things, did not separate terminals > from non-terminal rules, and did not define what a "token" is at all. I > guess only terminal rules define tokens and do *not* implicitly have > inserted whitespace (whereas non-terminal rules are combinations of > tokens which are inherently separated by whitespace). I do not see this > defined in any document cited by the spec. I believe the EBNF grammar defines productions and tokens separately, by convention, tokens are in CAPITAL CASE, but they can also be anything after the @terminals keyword. Whitespace by @pass, I believe. (see http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/turtle.bnf) 4.1 indicates that whitespace is used to separate two tokens that would otherwise be confused, which is rather vague, but common. This does allow for whitespace within tokens and IRI_REFs, but is not actually used in the grammar. > Should it be precisely defined what constitues whitespace between > tokens? There are many more unicode whitespace characters than the ws > rule in the spec. @pass provides a definition: @pass ::= [ \t\r\n]+ | "#" [^\r\n]* There's also a "ws" reference within the text of the spec, but this doesn't resolve. >> My parser [1] is LL(1). > > How do you deal with quotes in long string literals without readahead? Basically, I use a streaming tokenizer where the tokens are identified using regular expressions. Mine looks like the following: STRING_LITERAL_LONG2 = /"""(?:(?:"|"")?(?:[^"\\]|#{ECHAR}|#{UCHAR}))*"""/m # [90s] The only time the parser interacts with the tokenizer is during error recovery, where it just seeks forward until it finds a valid token based on the set of regular expressions identifying the terminal productions. Gregg > -dr > >
Received on Friday, 16 December 2011 03:14:38 UTC