SPARQL, named graphs and default graph from Nuutti Kotivuori on 2006-09-11 (public-sparql-dev@w3.org from July to September 2006)

From: Nuutti Kotivuori <naked@iki.fi>
Date: Mon, 11 Sep 2006 20:51:51 +0300
To: public-sparql-dev@w3.org
Message-ID: <87irju9tqg.fsf@aka.i.naked.iki.fi>
This isn't exactly a SPARQL question, but it is very closely
related. I will first outline the question context.

Assume an RDF statement store, which has a mechanism for tracking
statement origin (scope, context, graph, source whatever). Many of the
statements have a distinct origin, or source graph, they were imported
from. But there are also those which either seemingly have no origin,
or the origin is not known. The origin of these statements have to be
handled somehow. We'll come to the specific choices later on.

This statement store offers a SPARQL query interface into it. The
facilities for querying named graphs in SPARQL would obviously be used
to query the different origins in the store. But there are two things
to decide. First, how should statements without an origin be accessed
in SPARQL? There are several choices on this, which I will outline
below. And related to the first one, second, what should the default
graph be for the queries if none is given explicitly.

I will list a few possibilities and mention the problems and benefits
that seem to result from them as a basis for discussion.

 1. Unknown origin is a distinct node, but separate from all uris,
    blank nodes or literals. The default graph for the query is the
    graph of the unknown origin nodes.

    - Separation of identifier spaces, no fear of any overlap. The
      graph of statements with unknown origin is separate from any
      named graph.

    - Since there is no way to represent the unknown origin in SPARQL
      syntax, the default graph is the only way to access the nodes in
      that graph.

    - The nodes in the unknown origin graph are not matched by any
      graph query, since the name of the graph could not be returned
      reasonably. That is:

      SELECT ?g ?s ?o ?p
      WHERE { GRAPH ?g { ?s ?p ?o } }

      cannot return ?g for the unknown origin graph.

 2. Unknown origin is a distinct node, as above. The default graph is
    the RDF merge of all graphs in the store, including the statements
    with an unknown origin.

    - The problems above.

    - In addition, there is no way to select nodes that explicitly
      have an unknown origin. (Or is there? Could one match all the
      statements for which there is no graph with the same statement? 
      In any case, this would be quite contorted.)

 3. Unknown origin is represented by a distinct blank node; that is,
    every statement has it's own blank node as the graph name, which
    is not shared with any of the other statements. The default graph
    is the RDF merge of all graphs in the store, including the
    statements with an unknown origin.

    - This is probably closest to accurate modelling of the
      situation. We know every statement has an origin, we just don't
      know what it is - a situation commonly modelled with a blank
      node. Also, we don't know which statements might share an
      origin, so until we know better, we make them all distinct.

    - The origin of the statements is nicely queryable with SPARQL
      queries and every statement has an origin, even if unknown.

    - Queries which specify several statements from a single graph
      will not match the statements with unknown origins as it cannot
      be confirmed that they would be from the same graph.

    - There is no way to match the origin of a single statement as
      there is no way to match a certain blank node explicitly. The
      current SPARQL treats it as an open variable(?).

    - There is no way to explicitly match statements that have an
      unknown origin, since the origins are just distinct blank nodes.

    - Possibly hard to implement, because of the number of distinct
      blank nodes.

 4. Unknown origin is represented by a singleton blank node; that is,
    every statement with an unknown origin shares one single blank
    node as the graph name. The default graph is the RDF merge of all
    graphs in the store.

    - Lumps all statements with an unknown origin under a single named
      graph. Queries which match several statements from a single
      graph will match statement sets from unknown origin as well.

    - The origin of the statements is nicely queryable with SPARQL
      queries and every statement has an origin, even if unknown.

    - There is no way to explicitly match statements that have an
      unknown origin, since the origin is a single blank node. If the
      application provided a magic type for this blank node (_:x a
      rdfx:UnknownOrigin), this could be matched with:

      SELECT ?s ?o ?p
      WHERE { ?g a rdfx:UnknownOrigin .
              GRAPH ?g { ?s ?o ?p } }

      But this again is quite contorted. (The same could be applied to
      the third case as well, but the implementation of that would be
      really tricky to be effecient.)

 5. Unknown origin is represented by a singleton blank node as
    above. The default graph is the singleton blank node of unknown
    origin.

    - Mostly as above, but in the common case, explictly matching
      statements that have an unknown origin would be easy in just
      matching the statements from the default graph.

 6. Unknown origin is represented by a well known URI that is shared
    universally. The default graph is the RDF merge of all graphs in
    the store.

    - Somewhat incorrectly asserts that the statements have a certain
      origin, even though we don't know the origin.

    - The origin of the statements is nicely queryable with SPARQL.

    - Statements with an unknown origin can be easily explicitly
      matched by comparing them against the well known URI.

    - Assigns a special meaning to an URI.

    - Hard to coordinate with a number of people implementing similar
      solutions if not standardized.

Some other variants of the above were omitted, since their problems
and benefits are easily reasoned.

On irc, 'chimenzie' outlined the problem as such:

17:35 chimezie:#swig => Hmm.. well, seems like what is missing is a good 
      definition of a 'name for nodes that don't have an explicit context'
17:36 chimezie:#swig => or rather 'a name for the context of nodes that aren't 
      assigned to a context explicitely'

So, I'm out for some input on what might be the sanest route to
through this.

TIA,
-- Naked
Received on Monday, 11 September 2006 21:44:01 UTC