Re: SPARQL, named graphs and default graph from Richard Cyganiak on 2006-09-13 (public-sparql-dev@w3.org from July to September 2006)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Wed, 13 Sep 2006 17:49:23 +0200
To: Chimezie Ogbuji <ogbujic@bio.ri.ccf.org>
Cc: Nuutti Kotivuori <naked@iki.fi>, public-sparql-dev@w3.org
Message-Id: <1C7F291A-4EB6-4525-8A3A-094CDF6B098F@cyganiak.de>
Hi Chimezie,

On 13 Sep 2006, at 15:45, Chimezie Ogbuji wrote:
> On Wed, 13 Sep 2006, Richard Cyganiak wrote:
>> Some of your options are not really possible with named graphs  
>> because graphs need to be *named*, that is, the name *must* be a  
>> URI and not a blank node.
>
> I don't agree.  What's the source of this assertion?

The discussion is about SPARQL, so I assumed the definition of Named  
Graphs from the SPARQL spec would apply. See also various papers from  
Bizer et al., e.g. [1]. As Dan pointed out, there's no community  
consensus on wether Named Graphs are a good thing or not, but the  
definitions that use this very term seem to require URIs as graph  
names. Contexts are not Named Graphs.

[snip]
> Well, Blank nodes used within a graph can't be referred to directly  
> but they can still be matched by SPARQL - doesn't make them any  
> less useful.  The problem isn't the use of Blank nodes for graph  
> names but
> a the lack of a mechanism [2] to match the graph name(s) associated  
> with a node.  Given how closely coupled SPARQL is with (admittedly  
> informal) named graph semantics, I would expect to be able to  
> answer questions such as:
>
> "What are the graph names in which all the statements about  
> <someIRI> are asserted?"

I'm afraid I'm missing the point here. Why not this?

     SELECT DISTINCT ?graph WHERE { GRAPH ?graph { <someIRI> [] [] } }

(Now of course the problem is that when I allow blank nodes as graph  
labels, then the answer to this query might be: "a blank node, a  
blank node, and another blank node".)

[snip]
> If BNodes are used for existential assertions about nodes, why  
> wouldn't they be used as existential assertions about graphs?

I can offer my personal and subjective viewpoint: If you extend RDF  
triples with a fourth element that works exactly as the others, then  
it instantly raises the question why not to add a fifth element? Or a  
sixth?

I think that three is the sweet spot, but in practice triples often  
occur in "bags", and sometimes it's useful to be able to talk about  
these "bags", and I find that Named Graphs provide exactly the  
minimum of machinery necessary to do that, and nothing more.

I'm sure that a full-blown fourth element (and fifth) would offer  
lots of interesting possibilities, but personally I haven't come  
across any urgent need for it. Named Graphs, as defined in [1] and  
SPARQL, work well for me. YMMV, of course.

Yours,
Richard

[1] http://www.wiwiss.fu-berlin.de/suhl/bizer/pub/Carroll_etall- 
TrustWorkshop-ISWC2004.pdf


> And if there is some semantic consequence, it furthers the argument  
> that the formalisms for named graphs should be well articulated  
> before they are tightly integrated into a query language.
>
>> I would suggest that Alice and Bob each mint a new URI for the  
>> graph containing the statements of unknown origin *in their own  
>> store*. Or mint a new URI to hold each individual statement, or  
>> anything in between. Since the owner of a URI gets to say what the  
>> meaning of the URI is, they can declare that this chunk of URI  
>> space is reserved for this purpose (assuming Alice and Bob each  
>> own a chunk of URI space).
>>
>> I wonder why you discounted this solution?
>
> I don't think it's an elegant solution when we already have the  
> means (within 'vanilla' RDF Model Theory) to express existential  
> assertions - which is exactly the scenario here.
>
> If a graph label is nothing but a name associated with a set of  
> graphs, why should it not behave the same as the name associated  
> with a node within a graph?
>
>> I also question the existence of "statements without a known  
>> origin". They surely didn't just pop up magically inside your  
>> triple store, eh? I guess it's more like "statements whose origin  
>> I don't want to model".
>
> How different is this from "nodes whose names I don't care to  
> maintain / model?"
>
> [1] http://ninebynine.org/RDFNotes/ 
> UsingContextsWithRDF.html#xtocid-6303976
> [2] http://copia.ogbuji.net/blog/2006-07-14/querying-named-rdf- 
> graph-aggregate
>
> Chimezie Ogbuji
> Lead Systems Analyst
> Thoracic and Cardiovascular Surgery
> Cleveland Clinic Foundation
> 9500 Euclid Avenue/ W26
> Cleveland, Ohio 44195
> Office: (216)444-8593
> ogbujic@ccf.org
>
>
>>
>>
>> On 11 Sep 2006, at 19:51, Nuutti Kotivuori wrote:
>>
>>> This isn't exactly a SPARQL question, but it is very closely
>>> related. I will first outline the question context.
>>> Assume an RDF statement store, which has a mechanism for tracking
>>> statement origin (scope, context, graph, source whatever). Many  
>>> of the
>>> statements have a distinct origin, or source graph, they were  
>>> imported
>>> from. But there are also those which either seemingly have no  
>>> origin,
>>> or the origin is not known. The origin of these statements have  
>>> to be
>>> handled somehow. We'll come to the specific choices later on.
>>> This statement store offers a SPARQL query interface into it. The
>>> facilities for querying named graphs in SPARQL would obviously be  
>>> used
>>> to query the different origins in the store. But there are two  
>>> things
>>> to decide. First, how should statements without an origin be  
>>> accessed
>>> in SPARQL? There are several choices on this, which I will outline
>>> below. And related to the first one, second, what should the default
>>> graph be for the queries if none is given explicitly.
>>> I will list a few possibilities and mention the problems and  
>>> benefits
>>> that seem to result from them as a basis for discussion.
>>>  1. Unknown origin is a distinct node, but separate from all uris,
>>>     blank nodes or literals. The default graph for the query is the
>>>     graph of the unknown origin nodes.
>>>     - Separation of identifier spaces, no fear of any overlap. The
>>>       graph of statements with unknown origin is separate from any
>>>       named graph.
>>>     - Since there is no way to represent the unknown origin in  
>>> SPARQL
>>>       syntax, the default graph is the only way to access the  
>>> nodes in
>>>       that graph.
>>>     - The nodes in the unknown origin graph are not matched by any
>>>       graph query, since the name of the graph could not be returned
>>>       reasonably. That is:
>>>       SELECT ?g ?s ?o ?p
>>>       WHERE { GRAPH ?g { ?s ?p ?o } }
>>>       cannot return ?g for the unknown origin graph.
>>>  2. Unknown origin is a distinct node, as above. The default  
>>> graph is
>>>     the RDF merge of all graphs in the store, including the  
>>> statements
>>>     with an unknown origin.
>>>     - The problems above.
>>>     - In addition, there is no way to select nodes that explicitly
>>>       have an unknown origin. (Or is there? Could one match all the
>>>       statements for which there is no graph with the same  
>>> statement?
>>>       In any case, this would be quite contorted.)
>>>  3. Unknown origin is represented by a distinct blank node; that is,
>>>     every statement has it's own blank node as the graph name, which
>>>     is not shared with any of the other statements. The default  
>>> graph
>>>     is the RDF merge of all graphs in the store, including the
>>>     statements with an unknown origin.
>>>     - This is probably closest to accurate modelling of the
>>>       situation. We know every statement has an origin, we just  
>>> don't
>>>       know what it is - a situation commonly modelled with a blank
>>>       node. Also, we don't know which statements might share an
>>>       origin, so until we know better, we make them all distinct.
>>>     - The origin of the statements is nicely queryable with SPARQL
>>>       queries and every statement has an origin, even if unknown.
>>>     - Queries which specify several statements from a single graph
>>>       will not match the statements with unknown origins as it  
>>> cannot
>>>       be confirmed that they would be from the same graph.
>>>     - There is no way to match the origin of a single statement as
>>>       there is no way to match a certain blank node explicitly. The
>>>       current SPARQL treats it as an open variable(?).
>>>     - There is no way to explicitly match statements that have an
>>>       unknown origin, since the origins are just distinct blank  
>>> nodes.
>>>     - Possibly hard to implement, because of the number of distinct
>>>       blank nodes.
>>>  4. Unknown origin is represented by a singleton blank node; that  
>>> is,
>>>     every statement with an unknown origin shares one single blank
>>>     node as the graph name. The default graph is the RDF merge of  
>>> all
>>>     graphs in the store.
>>>     - Lumps all statements with an unknown origin under a single  
>>> named
>>>       graph. Queries which match several statements from a single
>>>       graph will match statement sets from unknown origin as well.
>>>     - The origin of the statements is nicely queryable with SPARQL
>>>       queries and every statement has an origin, even if unknown.
>>>     - There is no way to explicitly match statements that have an
>>>       unknown origin, since the origin is a single blank node. If  
>>> the
>>>       application provided a magic type for this blank node (_:x a
>>>       rdfx:UnknownOrigin), this could be matched with:
>>>       SELECT ?s ?o ?p
>>>       WHERE { ?g a rdfx:UnknownOrigin .
>>>               GRAPH ?g { ?s ?o ?p } }
>>>       But this again is quite contorted. (The same could be  
>>> applied to
>>>       the third case as well, but the implementation of that  
>>> would be
>>>       really tricky to be effecient.)
>>>  5. Unknown origin is represented by a singleton blank node as
>>>     above. The default graph is the singleton blank node of unknown
>>>     origin.
>>>     - Mostly as above, but in the common case, explictly matching
>>>       statements that have an unknown origin would be easy in just
>>>       matching the statements from the default graph.
>>>  6. Unknown origin is represented by a well known URI that is shared
>>>     universally. The default graph is the RDF merge of all graphs in
>>>     the store.
>>>     - Somewhat incorrectly asserts that the statements have a  
>>> certain
>>>       origin, even though we don't know the origin.
>>>     - The origin of the statements is nicely queryable with SPARQL.
>>>     - Statements with an unknown origin can be easily explicitly
>>>       matched by comparing them against the well known URI.
>>>     - Assigns a special meaning to an URI.
>>>     - Hard to coordinate with a number of people implementing  
>>> similar
>>>       solutions if not standardized.
>>> Some other variants of the above were omitted, since their problems
>>> and benefits are easily reasoned.
>>> On irc, 'chimenzie' outlined the problem as such:
>>> 17:35 chimezie:#swig => Hmm.. well, seems like what is missing is  
>>> a good
>>>       definition of a 'name for nodes that don't have an explicit  
>>> context'
>>> 17:36 chimezie:#swig => or rather 'a name for the context of  
>>> nodes that aren't
>>>       assigned to a context explicitely'
>>> So, I'm out for some input on what might be the sanest route to
>>> through this.
>>> TIA,
>>> -- Naked
>>
>>
>
Received on Wednesday, 13 September 2006 15:49:31 UTC