Re: mapping from Turtle grammar to RDF graph

* Eric Prud'hommeaux <eric@w3.org> [2010-02-02 16:50-0500]
> * Dave Beckett <dave@dajobe.org> [2010-02-02 07:54-0800]
> > Eric Prud'hommeaux wrote:
> > > Peter, all, anyone interested in debugging a mapping from a turtle
> > > grammar to triple production rules?
> > >   http://www.w3.org/2010/01/31-Turtle#⋈
> > > 
> > > I still need to stick encoding issues in there (like \"),
> > > but this should serve as a start.
> > 
> > I'm interested and it seems the right direction but I'm finding this a
> > little hard to understand.
> 
> I'm certainly sympathetic to that. Any ideas gratefully investigated.
> 
> >                              I'd hope that we can get out a strong
> > mapping (like this) which is sufficiently formal that it addresses the
> > concerns Peter raised in 2008 [1]
> 
> yeah, that's what motivated this. pfps outlines a recipe and i need to
> test my recipe against his. his target is ntriples, while i prefer to
> map to RDF terms and count on the ntriples spec to turn escaped URIs
> into IRIs.

Comparing pfps's recipe [1] aginst the recipe in [2] which unescapes a
set of terminals and defines the production of RDF terms from those
unescaped terminals:

[pfps] 0/ Handle escape characters and white space
[pfps] 0.2/ Turn each uriref into a URI references, handling escaping as in
[pfps]      S3.3 (and removing the enclosing <>).

[[
The characters between "<" and ">" are the unicode string of the
IRI. Relative IRI resolution is then performed per Relative IRI
Resolution.
]] — http://www.w3.org/2010/01/31-Turtle#handle-IRI_REF

[pfps] 0.3/ Turn each quotedString into a Normal Form C Unicode string,
[pfps]      handling escaping as in S3.3 (and removing the enclosing " or """).

(quoting 1 of 4 terminals for lexical forms) [[
The characters between the outermost "'"s are the unicode string of a
lexical form.
]] — http://www.w3.org/2010/01/31-Turtle#handle-STRING_LITERAL1
* as with SPARQL, this does not mandate normalization during parsing.
  A validating parser could, of course, do more.

[pfps] 0.4/ Discard any ws
[pfps] 1/ Turn each qname and URI reference into an RDF URI reference.
[pfps] 1.1/ Turn each URI reference into an RDF URI reference, as in S3.4.

I've copied the relative resolution code from SPARQL into
  http://www.w3.org/2010/01/31-Turtle#⋈
(quoting 1 of 3 terminals for URI production) [[
Relative IRI resolution is then performed per Relative IRI
Resolution⋈.
]] — http://www.w3.org/2010/01/31-Turtle#handle-IRI_REF

[pfps] 1.2/ Expand each qname into a uriref as in S2.1, which will be an
[pfps]      RDF URI reference (because all relative URIs have been dealt with
[pfps]      already). 
[pfps] 1.3/ Replace each occurence of 'a' as a verb with the RDF URI reference 
[pfps]  rdf:type

[[
If token matched was "a", curPredicate is bound to the IRI
http://www.w3.org/1999/02/22-rdf-syntax-ns#type (test: aVerb1).
]] — http://www.w3.org/2010/01/31-Turtle#curPredicate

[pfps] 1.4/ Discard any directive and trailing .
[pfps] 2/ Turn each literal into an RDF literal.  The only non-obvious part is
[pfps]    to add the appropriate datatype to integer, double, decimal, and
[pfps]    boolean.

[[
The literal has a lexical form of the input string, and a datatype of
xsd:integer.
]] — http://www.w3.org/2010/01/31-Turtle#handle-INTEGER
SPARQL parsing doesn't demand either canonicalization or validation.
Similar treatment for DECIMAL, DOUBLE, BooleanLiteral.

[pfps] >From now on the process is working with a sequence of processed
[pfps] occurences of the triples production, i.e., pieces of the occurences may
[pfps] have been replaced with abstract objects.
[pfps] 
[pfps] 3/ Handle blank nodes
[pfps] 3.1/ For each name used in a nodeID in the document select a fresh blank
[pfps]      node and replace any occurence of nodeID of the form _:name with
[pfps]      that blank node.  This processes each of the occurences.
[pfps] 3.2/ Recursively, until no unprocessed blank is left in the document,
[pfps]      select an unprocessed blank that does not contain an unprocessed
[pfps]      blank, select a fresh blank node, and process the blank as follows:
[pfps]      a) If blank is of the form [] replace it with the fresh blank node.
[pfps]      b) If blank is of the form [ predicateObjectList ] replace it with
[pfps]  fresh blank node and add a new triples consisting of the fresh
[pfps]  blank node (as subject)  and the predicateObjectlist. 
[pfps]      c) If blank is of the form () replace it with the RDF URI
[pfps]  reference rdf:nil
[pfps]      e) If blank is of the form ( object1 ... objectn ) for n>=1
[pfps]  - select n fresh nodes, node1, ...., noden, 
[pfps]  - replace the blank with node1,
[pfps]  - add 2n-2 triples with triple 2i-1 having subject nodei,
[pfps]    verb rdf:first, and object objecti and triple 2i having
[pfps]    subject nodei, verb rdf:rest, and object nodei+1, and
[pfps]  - add two triples with the first having subject noden, verb
[pfps]    rdf:first, and object objectn and the second having subject
[pfps]    noden, verb rdf:rest, and object rdf:nil  (Yes, this is being
[pfps]    a bit sloppy.)
[pfps] 4/ Handle ; constructs
[pfps] 4.1/ Recursively replace any subject verb1 objectlist1 ; verb2 objectlist2
[pfps]    with subject verb1 objectlist1 . subject verb2 objectlist2
[pfps] 4.2/ Remove any remaining ;
[pfps] 5/ Handle , constructs
[pfps] 5.1/ Recursively replace any subject verb object1 , object2
[pfps]      with subject verb object1 . subject verb object2
[pfps] 6/ Turn each subject verb object . into an RDF triple.
[pfps] 
[pfps] Selecting a fresh blank node means to select a blank node (from the
[pfps] infinite collection of blank nodes available) that has not yet been used
[pfps] in the process so far.

I took a different path here, specifying productions which generate
the subject, predicate and object of each triple.

[[
Each GraphNode in the document produces an RDF triple of the
curSubject, curPredicate and the GraphNode.
]] — http://www.w3.org/2010/01/31-Turtle#triples
Once we find an acceptable style for this, I'll add list generation.

> > It also might be worth starting to consider whether to align the terminals
> > (qnames) more with sparql first.
> 
> the productions ref'd in http://www.w3.org/2010/01/31-Turtle#⋈ are
> from a yacker mockup of "TurtleS" (Turtle using SPARQL terminals and
> productions, where applicable). it may still be too liberal -- needs
> some thought and testing against bad-\d\d.ttl.
> 
> > Dave
> > 
> > [1] http://lists.w3.org/Archives/Public/semantic-web/2008Jan/0128.html
> > via my Turtle issue list
> > http://github.com/dajobe/turtle/blob/master/ISSUES.md
[2] http://www.w3.org/2010/01/31-Turtle#⋈

> -- 
> -ericP

-- 
-ericP

Received on Wednesday, 3 February 2010 15:36:06 UTC