API for loading datasets; was Re: TriG being disjoint from Turtle from Sandro Hawke on 2013-05-21 (public-rdf-comments@w3.org from May 2013)

From: Sandro Hawke <sandro@w3.org>
Date: Tue, 21 May 2013 13:33:03 -0400
To: Jan Wielemaker <J.Wielemaker@vu.nl>
CC: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-comments@w3.org
Message-ID: <519BAFCF.1000403@w3.org>
On 05/17/2013 08:09 AM, Jan Wielemaker wrote:
> Hi Sandro,
>
> On 05/17/2013 01:38 PM, Sandro Hawke wrote:
>> On 05/17/2013 06:00 AM, Jan Wielemaker wrote:
>>> On 05/17/2013 11:49 AM, Andy Seaborne wrote:
>>>
>>> [this fragment is from Charles Greer, not answered by Andy]
>>>
>>>> 1.  Could the spec be modified to allow TriG to be a superset of
>>>> turtle?  Specifically, could the production rules be modified to allow
>>>> a set of triples outside of any '{'  '}' to be the same as triples 
>>>> in a
>>>> default anonymous graph?  It seems that even now, the rules allow
>>>> multiple anonymous graph productions, whose union would be the unnamed
>>>> graph.  It would be convenient if we could dispense with these 
>>>> anonymous
>>>> curly braces altogether if possible.
>>>
>>> Having implemented TriG yesterday on top of the Turtle parser, I must
>>> say that I was happily surprised that TriG does not allow for triples
>>> outside {}.  This means you can detect whether a document is a Turtle
>>> or TriG document at the first triple.
>>
>> Why do you want to do that?      I'm imagining a world where people load
>> data by URL, not necessarily knowing if it's going to have named graphs
>> in it.
>>
>> I'd think in a load_graph operation, you'd accept TriG as well, using
>> the default graph as the output graph.   Maybe have a flag about whether
>> to ignore or raise on error if there are some named graphs as well.
>>
>> And in a load_dataset operations, I'd think you'd accept Turtle as well,
>> and just not get any named graphs out of it.
>
> I am not yet sure.  Having to deal with files, loading of which can
> create or extend multiple graphs is something new in the design of
> SWI-Prolog's RDF store.  There are two things for which I do not yet
> have a good answer: implementing `unloading' the data and dealing with
> the persistent backup.
>
> The system currently loads a source into a named graph named after the
> source. After loading, the graph is saved in a fast and compact binary
> format into a file named after the graph-name. Subsequent modifications
> are saved in a `journal' file, also named after the graph-name.
> Unloading a source finds the graph, removes all triples from memory and
> deletes the backup files.
>

(Yes, I have fond memories of using swipl.)

> This schema won't fly easily with TriG files.  TriG files can create
> multiple graphs and/or add triples to multiple graphs.  TriG files are
> also likely to change the granularity of named graphs, which makes the
> file-per-named-graph backup module inadequate.  I don't know yet how I'm
> going to solve that, but I think it is likely that knowing beforehand
> that I'm dealing with a TriG file will be useful information.
>

Interesting problem.   Brainstorming a bit....

== Design-1 ==

Treat a TriG file as set of Turtle files.   User loads x.trig

    { <s> <p> 1 }
    <g1> { <s> <p> 1,2 }


so you treat that as if they loaded a turtle file called "x.trig"

    <s> <p> 1

and a turtle file called "g1"

    <s> <p> 1,2 

You cache and back them up just like that.  Somewhere internally you 
remember that unloading trig.x really means to also unload g1.

== Design-2 ==

Explicit metadata.   User loads x.trig and ends up with a new graph 
called "x.trig" containing triples like:

      <x.trig> ds:defaultGraph <sk01>
      <g1> ds:nameFor <sk02>

and then graph <sk01> has the default graph triples in it, while <sk02> 
has the g1 triples in it.    <sk01> and <sk02> are system generated 
graph names, or could be blank nodes if that's something you support.

Now unloading doesn't need to remember anything internally.   When you 
unload a graph, if is has ds:defaultGraph or ds:nameFor triples in it, 
you unload the graphs named after the objects of those triples as well.

== Design-3 ==

use a different operation:

     load_dataset acts like in design-1, but hands back the list of all 
graphs created.  That list has to be handed to unload_dataset, so no 
private internal storage is needed.

I'd also provide load_dataset_safe or a "safe=True" option on 
load_dataset which makes it behave like design-2 -- putting everything 
in newly named graphs.    I'd probably return a structure giving the 
mapping between the names used in the source and skNNN names assigned, 
rather than put that into the quadstore.

Maybe load_dataset is called load_multiple, and it can optionally take a 
list of sources.  Maybe it could even do some crawling while it's 
loading.  In either case, it'd have the same API options as load_dataset 
above, I think.

== == ==

Okay, I'm pretty happy with design-3.   What do you think?

           -- Sandro



>     Cheers --- Jan
>
> P.s.    still hoping for an
>         @format <http://www.w3.org/TR/2013/CR-turtle-20130219/> .
>     or similar.
>
>
>
Received on Tuesday, 21 May 2013 17:33:20 UTC