Re: SPARQL and Turtle Prefix Placement

* Andy Seaborne <andy.seaborne@epimorphics.com> [2012-06-15 21:32+0100]
> 
> 
> On 15/06/12 21:21, Eric Prud'hommeaux wrote:
> >* Andy Seaborne<andy.seaborne@epimorphics.com>  [2012-06-15 20:13+0100]
> >>I prefer Gavin's approach.
> >>
> >>No BASE PREFIX; Put '@base' and '@prefix' in the directives.
> >>
> >>http://lists.w3.org/Archives/Public/public-rdf-wg/2012May/0353.html
> >
> >Given the string [[
> >   @base<foo>  .
> >   <s>  <p>  "o"@base .
> >]]
> >
> >, I think we all agree that @base gets used two ways, once as a keyword and once as a language tag. Given a grammar like
> >[[
> >   Doc ::= (Base | Triple)*
> >   Base ::= "@base" IRIREF "."
> >   Triple ::= IRIREF IRIREF RDFLiteral "."
> >   RDFLiteral ::= STRING LANGTAG?
> >   IRIREF ::= "<" [^>]+ ">"
> >   STRING ::= '"' [^"]+ '"'
> >   LANGTAG ::= '@' [a-zA-Z]+ /* also matches "@base" */
> >]]
> >, what terminal was matched when the lexer consumed the first "@base"? How about the second?
> 
> Tokenizers may take different approaches but things like first
> mentioned or most specific both cover this case.
> 
> >The Turtle spec currently doesn't say how to process terminals, or even that there are such a thing. SPARQL's rules 3 and 4 are relevent here:
> >
> >   3 When tokenizing the input and choosing grammar rules, the longest match is chosen.
> 
> They are the same length :-)
> 
> >   4 The SPARQL grammar is LL(1) when the rules with uppercased names are used as terminals.
> >
> >The only machine-readable way I can think to codify that "@base" after a STRING is a LANGTAG involves copying those from SPARQL and adding another "3.5 If multiple terminals of equal length match, the one earliest in the grammar is chosen". Then we simply need BASE before LANGTAG. This may seem needlessly arcane, but I suspect that many Turtle parsers will not do the right thing without this construct. (We could tell them in English, but then they have to figure out how to implement it, and if they're not using a recursive descent parser generator, they'll have to follow the recipe above anyways.)
> 
> As does using an inplace literal "@base" because grammar rules are
> before tokens.

If that (unstated) rule is is followed, "@base" will always match the inplace literal and never produce a LANGTAG.


> I don't see how RC affects the situation except that they are the
> easiest to make context sensitive and inline constants "just work".

In RC, you can follow more than one path at once. In yacc-style parsers, a given string will be lexed as exactly one token, every time, without regard to context. "@base" will always be the implicit terminal, or it will always be a LANGTAG, but not an implicit terminal once and then a LANGTAG.


> >>(and it works in parser generators I have used)
> >
> >Which ones were those? How do they do resolve the conflict? Do they detect intersecting lexical tokens and generate both tokens? (This wouldn't be hard in a recursive descent parser, but isn't in yacc-like LALR(1) or LL(1) parsers.)
> >
> 
> This is about tokens.  Not LALR(1) / LL(1) isms.
> 
> For example, javacc (which is both parser generator and lexer
> generator) uses an inline literal string as specific thing to test
> for.
> 
>  Andy
> 
> 
> 
> >
> >> Andy
> >>
> >>On 15/06/12 19:56, Eric Prud'hommeaux wrote:
> >>>* Gavin Carothers<gavin@carothers.name>   [2012-06-15 10:44-0700]
> >>>>On Fri, Jun 15, 2012 at 9:48 AM, Eric Prud'hommeaux<eric@w3.org>   wrote:
> >>>>>+[20]   LANGTAG         ::=     BASE | PREFIX | '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
> >>>>
> >>>>
> >>>>No, reverting back to the PREFIX BASE terminals is not acceptable.
> >>>>This was already the subject of review by Andy and Peter.
> >>>>
> >>>>Please see thread
> >>>>http://lists.w3.org/Archives/Public/public-rdf-wg/2012May/0347.html
> >>>>for discussion on the change from PREFIX BASE to a simpler LANGTAG.
> >>>
> >>>But that thread didn't terminate in consensus.
> >>>Andy's point
> >>>[[
> >>>     (to the casual reader : BASE is '@base' and PREFIX is '@prefix'
> >>>
> >>>     Which is ambiguous - as it says:
> >>>
> >>>     LANGTAG ::= ('@base' | '@prefix' | '@' ([a-zA-Z])+ ('-' ([a-zA-Z0-9])+)
> >>>
> >>>     so the string "@base" matches two ways.
> >>>
> >>>     But even if sorted out ... it means a tokenizer may well generate the
> >>>     token LANGTAG ... and then:
> >>>
> >>>     [5]  base  ::=  BASE IRIREF
> >>>
> >>>     does not match as the token is LANGTAG, not BASE.  Oops.
> >>>]]
> >>>
> >>>is addressed by moving the "BASE | PREFIX | " from LANGTAG to RDFLiteral:
> >>>
> >>>   RDFLiteral ::= String (BASE | PREFIX | LANGTAG | '^^' iri)?
> >>>
> >>>Turtle doesn't talk about parsing rules (perhaps it should); SPARQL's note 3 says [[
> >>>When tokenizing the input and choosing grammar rules, the longest match is chosen.
> >>>]] —<http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#sparqlGrammar>
> >>>
> >>>This doesn't establish a relative order between terminals implied by ""'d strings in the productions vs. explicit terminals like "LANGTAG ::= '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*". After failing a few tests, people would likely add an order to make "@base" and "@prefix" parse as implicit terminals and never parse them as language tags. We can be much more explicit if use the above production for RDFLiteral. An aesthetic option would be to break it up for semantic clarity:
> >>>
> >>>   RDFLiteral  ::= String (LanguageTag | '^^' iri)?
> >>>   LanguageTag ::= BASE | PREFIX | LANGTAG
> >>>
> >>>I've commited that for everyone's viewing pleasure.
> >>>
> >>>I also found some errors in STRING_LITERAL ("s vs. 's reverse, so 's not allowed within "" string). I'm now validating with this text (note the long quotes):
> >>>[[
> >>>[]<p>   <o1>, "o2", [<p2>   _:o3 ] ;
> >>>    <p3>   (<o4>   "o5"@base "o5"@prefix _:o6 [<p4>   <o8>   ] ),<o9>   .
> >>>[<p5>   """o10
> >>>""line"" '''2'''""", '''o11
> >>>''line'' """3"""'''^^<integer>   ;
> >>>   <p6>   12, +12, -12,                   # [+-]? [0-9]+
> >>>        13.0, +13.0, -13.0,             # [+-]? [0-9]* '.' [0-9]+ with *=2
> >>>        .0, +.0, -.0,                   # [+-]? [0-9]* '.' [0-9]+ with *=0
> >>>        14.E0, +14.E0, -14.E0,          # [+-]? [0-9]+ '.' [0-9]* EXPONENT with *=0
> >>>        14.0E0, +14.0E0,                # [+-]? [0-9]+ '.' [0-9]* EXPONENT with *=1
> >>>        .14E2, +.14E2, -.14E2, -14.0E0, # [+-]? '.' [0-9]+ EXPONENT
> >>>        1.4E1, +1.4E1, -1.4E1,          # [+-]? [0-9]+ EXPONENT)
> >>>        14e0, 14e+0, 14e-0              # [eE] [+-]? [0-9]+
> >>>].
> >>>[[
> >>>
> >>>
> >>>>Also please make sure updates to the grammar are also checked into the
> >>>>http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/turtle.bnf not
> >>>>only the HTML.
> >>>
> >>>will do.
> >

-- 
-ericP

Received on Friday, 15 June 2012 21:11:16 UTC