Re: SPARQL and Turtle Prefix Placement from Eric Prud'hommeaux on 2012-06-15 (public-rdf-wg@w3.org from June 2012)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 15 Jun 2012 16:21:47 -0400
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: Gavin Carothers <gavin@carothers.name>, public-rdf-wg@w3.org
Message-ID: <20120615202146.GB27073@w3.org>
* Andy Seaborne <andy.seaborne@epimorphics.com> [2012-06-15 20:13+0100]
> I prefer Gavin's approach.
> 
> No BASE PREFIX; Put '@base' and '@prefix' in the directives.
> 
> http://lists.w3.org/Archives/Public/public-rdf-wg/2012May/0353.html

Given the string [[
  @base <foo> .
  <s> <p> "o"@base .
]]

, I think we all agree that @base gets used two ways, once as a keyword and once as a language tag. Given a grammar like
[[
  Doc ::= (Base | Triple)*
  Base ::= "@base" IRIREF "."
  Triple ::= IRIREF IRIREF RDFLiteral "."
  RDFLiteral ::= STRING LANGTAG?
  IRIREF ::= "<" [^>]+ ">"
  STRING ::= '"' [^"]+ '"'
  LANGTAG ::= '@' [a-zA-Z]+ /* also matches "@base" */
]]
, what terminal was matched when the lexer consumed the first "@base"? How about the second?

The Turtle spec currently doesn't say how to process terminals, or even that there are such a thing. SPARQL's rules 3 and 4 are relevent here:

  3 When tokenizing the input and choosing grammar rules, the longest match is chosen.
  4 The SPARQL grammar is LL(1) when the rules with uppercased names are used as terminals.

The only machine-readable way I can think to codify that "@base" after a STRING is a LANGTAG involves copying those from SPARQL and adding another "3.5 If multiple terminals of equal length match, the one earliest in the grammar is chosen". Then we simply need BASE before LANGTAG. This may seem needlessly arcane, but I suspect that many Turtle parsers will not do the right thing without this construct. (We could tell them in English, but then they have to figure out how to implement it, and if they're not using a recursive descent parser generator, they'll have to follow the recipe above anyways.)


> (and it works in parser generators I have used)

Which ones were those? How do they do resolve the conflict? Do they detect intersecting lexical tokens and generate both tokens? (This wouldn't be hard in a recursive descent parser, but isn't in yacc-like LALR(1) or LL(1) parsers.)


>  Andy
> 
> On 15/06/12 19:56, Eric Prud'hommeaux wrote:
> >* Gavin Carothers<gavin@carothers.name>  [2012-06-15 10:44-0700]
> >>On Fri, Jun 15, 2012 at 9:48 AM, Eric Prud'hommeaux<eric@w3.org>  wrote:
> >>>+[20]   LANGTAG         ::=     BASE | PREFIX | '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
> >>
> >>
> >>No, reverting back to the PREFIX BASE terminals is not acceptable.
> >>This was already the subject of review by Andy and Peter.
> >>
> >>Please see thread
> >>http://lists.w3.org/Archives/Public/public-rdf-wg/2012May/0347.html
> >>for discussion on the change from PREFIX BASE to a simpler LANGTAG.
> >
> >But that thread didn't terminate in consensus.
> >Andy's point
> >[[
> >     (to the casual reader : BASE is '@base' and PREFIX is '@prefix'
> >
> >     Which is ambiguous - as it says:
> >
> >     LANGTAG ::= ('@base' | '@prefix' | '@' ([a-zA-Z])+ ('-' ([a-zA-Z0-9])+)
> >
> >     so the string "@base" matches two ways.
> >
> >     But even if sorted out ... it means a tokenizer may well generate the
> >     token LANGTAG ... and then:
> >
> >     [5]  base  ::=  BASE IRIREF
> >
> >     does not match as the token is LANGTAG, not BASE.  Oops.
> >]]
> >
> >is addressed by moving the "BASE | PREFIX | " from LANGTAG to RDFLiteral:
> >
> >   RDFLiteral ::= String (BASE | PREFIX | LANGTAG | '^^' iri)?
> >
> >Turtle doesn't talk about parsing rules (perhaps it should); SPARQL's note 3 says [[
> >When tokenizing the input and choosing grammar rules, the longest match is chosen.
> >]] —<http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#sparqlGrammar>
> >
> >This doesn't establish a relative order between terminals implied by ""'d strings in the productions vs. explicit terminals like "LANGTAG ::= '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*". After failing a few tests, people would likely add an order to make "@base" and "@prefix" parse as implicit terminals and never parse them as language tags. We can be much more explicit if use the above production for RDFLiteral. An aesthetic option would be to break it up for semantic clarity:
> >
> >   RDFLiteral  ::= String (LanguageTag | '^^' iri)?
> >   LanguageTag ::= BASE | PREFIX | LANGTAG
> >
> >I've commited that for everyone's viewing pleasure.
> >
> >I also found some errors in STRING_LITERAL ("s vs. 's reverse, so 's not allowed within "" string). I'm now validating with this text (note the long quotes):
> >[[
> >[]<p>  <o1>, "o2", [<p2>  _:o3 ] ;
> >    <p3>  (<o4>  "o5"@base "o5"@prefix _:o6 [<p4>  <o8>  ] ),<o9>  .
> >[<p5>  """o10
> >""line"" '''2'''""", '''o11
> >''line'' """3"""'''^^<integer>  ;
> >   <p6>  12, +12, -12,                   # [+-]? [0-9]+
> >        13.0, +13.0, -13.0,             # [+-]? [0-9]* '.' [0-9]+ with *=2
> >        .0, +.0, -.0,                   # [+-]? [0-9]* '.' [0-9]+ with *=0
> >        14.E0, +14.E0, -14.E0,          # [+-]? [0-9]+ '.' [0-9]* EXPONENT with *=0
> >        14.0E0, +14.0E0,                # [+-]? [0-9]+ '.' [0-9]* EXPONENT with *=1
> >        .14E2, +.14E2, -.14E2, -14.0E0, # [+-]? '.' [0-9]+ EXPONENT
> >        1.4E1, +1.4E1, -1.4E1,          # [+-]? [0-9]+ EXPONENT)
> >        14e0, 14e+0, 14e-0              # [eE] [+-]? [0-9]+
> >].
> >[[
> >
> >
> >>Also please make sure updates to the grammar are also checked into the
> >>http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/turtle.bnf not
> >>only the HTML.
> >
> >will do.

-- 
-ericP
Received on Friday, 15 June 2012 20:22:18 UTC