Re: Querying only the default graph from the data store from Barry Bishop on 2012-09-07 (public-rdf-dawg-comments@w3.org from September 2012)

From: Barry Bishop <barry.bishop@ontotext.com>
Date: Fri, 07 Sep 2012 16:30:06 +0200
To: "public-rdf-dawg-comments@w3.org" <public-rdf-dawg-comments@w3.org>
Message-ID: <504A04EE.2030300@ontotext.com>
Dear WG,

Please note that I do not expect a formal reply to my comments, but 
would welcome the opportunity to continue discussions in some future 
incarnation of the WG.

Regards,
barry

On 07/09/12 13:44, Barry Bishop wrote:
> Hello Axel,
>
> On 05/09/12 21:14, Polleres, Axel wrote:
>> Thanks Barry,
>>
>> Since you confirm that the response addresses your comment, please 
>> consider this reply informal (chair-hat off).
>>
>>> I feel this is a shame, as two different implementations can
>>> produce different output from the simplest of queries, e.g.
>>> SELECT * { ?s ?p ?o }
>> I personally find this quite normal... different endpoints
>> respond differently to such query since they refer to different 
>> default datasets, i.e.
>> Naturally when I query dbpedia.org I qury a different dataset than 
>> data.semanticweb.org, etc.
>
> Well, dbpedia.org and data.semanticweb.org sparql endpoints make 
> different data available, so I suppose you would naturally get 
> different results to the same query. However, this is not what I was 
> getting at. In fact, I'm not sure I have managed to get my point 
> across at all. Perhaps another hypothetical example:
>
> Suppose you run a development team that builds an application that 
> interacts with some public sparql endpoint, say http://xyz.org/sparql 
> - then one day xyz.org start to have scalability problems and decide 
> to upgrade their RDF database to some expensive new thing. Both old 
> and new RDF databases are fully compliant with W3C, but after they 
> upgrade your application is completely broken only because the two 
> database implementations construct their RDF dataset differently when 
> no FROM clauses are given. I am sure you wouldn't find it so natural 
> in this case.
>
> There are some workarounds as you say, but not in all cases. When you 
> are using someone else's database and don't get to decide how they 
> partition their data in to separate graphs, then you can be completely 
> stuck. As fabulous as the query language is (and I do think it is 
> tremendous achievement), this ambiguity over constructing a dataset 
> when there are no FROMs is a bit of a hole.
>
>>
>> Notably, I'd like to also point you to the another document within 
>> the SPARQL1.1 specification,
>> i.e. the service-description document at
>> http://www.w3.org/TR/sparql11-service-description/
>> which provides means to describe which graphs compose the default
>> dataset of a particular service endpoint.
>> Particularly, the property
>> http://www.w3.org/TR/sparql11-service-description/#sd-defaultDataset
>> is intended to provide a description of the default dataset that an 
>> endpoint uses.
>> Note also that the service desription voaculary is extensible, and 
>> what we specify now is only a core, but other vocabulary can be used 
>> to extend this (e.g. VoID)
>
> All well and good, if this feature is actually provided by an 
> endpoint. However, it requires quite a lot of programming for a client 
> to work all this out and re-write queries accordingly. And actually, 
> it still doesn't help - e.g. if the endpoint you want to use 
> constructs the dataset as an RDF merge of all graphs (when no FROM 
> clauses are given [I need to find an abbreviation for this]) and you 
> only want to query the default graph, then you just can't do it. There 
> is no way to tell such an endpoint that you only want the default 
> graph using the query language.
>
> The problem is basically that the default graph is special - because 
> it doesn't have an identifier it can not be used in the same way as 
> named graphs....
>
> ... in the query language. However, in the update language the 
> appropriate syntax has already been created and would be the perfect 
> complement to the query language, e.g. if I can do this:
>
>     CLEAR DEFAULT
>
> why can't I do this:
>
>     SELECT *
>     FROM DEFAULT
>     {...}
>
> and specify absolutely unambiguously that I want my query to execute 
> *only* over the default graph in the database. No matter how an 
> implementation constructs its dataset when no FROM clauses are given, 
> this syntax should always work in the expected way.
>
> Since I am rambling on, the related keywords from the update language 
> would also be very useful, e.g. one can clear all graphs like this:
>
>     CLEAR ALL
>
> so why not be able to do this:
>
>     SELECT *
>     FROM ALL
>     {...}
>
> This would help in the opposite case, when an implementation 
> constructs the dataset using only the default graph (when no FROM 
> clauses are given). In this situation, it is not possible to query for 
> the graph names (using select distinct ?g {graph ?g {?s ?p ?o}}), so 
> the above would say: "please merge all graphs for input to my query, 
> even though I don't know what their names are and have no way of 
> finding out (using the query language)".
>
> These things might not seem important, but they are life and death to 
> application programmers. Right now, to build an application that needs 
> to interact with a sparql endpoint that is only known at runtime is 
> fraught with difficulties. Not the least of which is that if your 
> application is required to query data only from the default graph, 
> then there is no way to write a query that is guaranteed to do this on 
> all (W3C compliant) sparql endpoints.
>
> Which I still feel is a bit of a shame.
>
> barry
>
>
>>
>> As for the rest of your response, we seem to agree that what you're 
>> aiming at
>> is rather a new feature than something this working group can address 
>> within its current
>> charter and resources.
>>
>> Best regards,
>> Axel
>>
>>> -----Original Message-----
>>> From: Barry Bishop [mailto:barry.bishop@ontotext.com]
>>> Sent: Mittwoch, 05. September 2012 19:49
>>> To: Polleres, Axel
>>> Cc: public-rdf-dawg-comments@w3.org
>>> Subject: Re: Querying only the default graph from the data store
>>>
>>> Hello Axel,
>>>
>>> Thanks for taking the time to reply. I realise this thread is
>>> somewhat out of place given the status/progress of the WG.
>>>
>>> Your reply does address my initial post. It does not resolve
>>> it, but this is perhaps not the time. However, for the
>>> purpose of clarity I will make further comments inline:
>>>
>>> On 05/09/12 04:11, Polleres, Axel wrote:
>>>> Hi Barry,
>>>>
>>>> This is in response to
>>>>
>>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Aug/0
>>>> 011.html
>>>>
>>>>> The working draft does not specify how the RDF dataset is
>>> constructed
>>>>> when no FROM and FROM NAMED clauses are present in the
>>> SPARQL query.
>>>>> Implementations are therefore able to construct the dataset
>>>>> differently, e.g.
>>>>> a. dataset default graph contains only the data store's
>>> default graph
>>>>> b. dataset default graph contains the RDF merge of all
>>> graphs in the
>>>>> data store
>>>> It is correct that how the concrete default dataset of a
>>> SPARQL endpoint is conctructed is left open to
>>> implementations. Since different endpoints and
>>> implementations support different behaviours in this regard
>>> (e.g. in some implementations the default graph of the
>>> default dataset is the union of all named graphs whereas in
>>> others this is not the case), the working group does not feel
>>> that there is a unique standard behavior to be advocated this
>>> time around.
>>>
>>> I feel this is a shame, as two different implementations can
>>> produce different output from the simplest of queries, e.g.
>>> SELECT * { ?s ?p ?o }
>>>
>>> However, this is a separate issue.
>>>
>>>>> As soon as a single FROM or FROM NAMED clause is used then
>>> the data
>>>>> store's default graph is excluded from the query's dataset.
>>>>>
>>>>> Which means that there is no portable way to defne a
>>> SPARQL query so
>>>>> that it executes only against the default graph in the
>>> data store -
>>>>> or even against a combination of the default graph and one or more
>>>>> named graphs.
>>>> Please note that a) querying the default graph in the
>>> datastore is the standard behavior when no explicit FROM or
>>> FROM NAMED clauses are given. b) the combination of querying
>>> named graphs and the default graph of the endpoint's default
>>> dataset is supported via GRAPH graph patterns.
>>>
>>> a) This is rather inconsistent. Above you say that the
>>> construction of the default RDF dataset (when no FROM/FROM
>>> NAMED clauses are given) is not defined, but here you say
>>> constructing it using the default graph only is the 'standard
>>> behaviour'. One of the motivations for this post is that
>>> there are good reasons not to have only the default graph in
>>> the 'default dataset', e.g. you wouldn't be able to do this
>>> to find out the graph names when presented with an unknown endpoint:
>>>
>>> SELECT DISTINCT ?g WHERE { GRAPH ?g {?s ?p ?o } }
>>>
>>> Anyway, the point here is that there is no *portable* way to
>>> query just the default graph.
>>>
>>> b) yes, but you can't query the RDF merge of the default
>>> graph and a named graph in the same way with two named
>>> graphs, e.g. FROM ex:g1 FROM ex:g2. Instead one would need to
>>> use a triple and graph pattern union, which for complex
>>> queries becomes cumbersome. Put another way, any combination
>>> of named graphs can be merged and explored with query triple
>>> patterns, but this can't be done with any combination of
>>> named graphs and the default graph.
>>>
>>>
>>>> See also examples below.
>>>>
>>>>> This is a problem that often confuses users of RDF data
>>> stores and is
>>>>> likely to lead to implementations that provide their own specific
>>>>> means to achieve this, e.g.
>>>>> http://www.openrdf.org/issues/browse/SES-850
>>>>>
>>>>> Inspired by the update language's use of the 'DEFAULT' keyword for
>>>>> graph manipulation, I suggest an extension to the query
>>> language that
>>>>> allows "FROM DEFAULT" to be used, e.g.
>>>>>
>>>>> SELECT *
>>>>> FROM DEFAULT
>>>>> WHERE { ..... }
>>>>>
>>>>> => dataset contains a default graph made up of the data store's
>>>>> default graph only
>>>> Please note that this the standard behaviour when no FROM clause is
>>>> given, i.e. this corresponds to
>>>>
>>>> SELECT *
>>>> WHERE { ..... }       <--- (no use of GRAPH keyword)
>>> I don't think this is "standard behaviour", rather it is
>>> common behaviour. It can not be standard when the
>>> construction of the dataset is implementation dependent when
>>> no FROM clause is given.
>>>
>>>>> This construct can be used with any number of FROM <uri>
>>> or FROM NAMED
>>>>> <uri> clauses, e.g.
>>>>>
>>>>> SELECT *
>>>>> FROM DEFAULT
>>>>> FROM <http://example.com#g1>
>>>>> WHERE { ..... }
>>>>>
>>>>> => dataset contains a default graph made up of the data
>>> store's default
>>>>> graph merged with the contents of the data store's g1 graph
>>>>> This would be a fairly trivial change for exisiting sparql
>>> processor
>>>>> implementations, but would provide a big improvement in
>>>>> functionality/flexibility by allowing a data store's
>>> default graph to be
>>>>> used/queried/merged in the same way as any of it's named graphs.
>>>> Note that similar to the example above, you can query the
>>> default graph and named graphs within the default dataset in
>>> a data store side by side by using GRAPH graph patterns, i.e.
>>>>    SELECT *
>>>>    WHERE
>>>>    {
>>>>      .....                              <-- (no use of
>>> GRAPH) matches the default graph
>>>>      GRAPH <http://ex.com#g1> { .... }  <-- matches named
>>> graph g1 (assuming g1 is a named graph in the default dataset)
>>>>    }
>>> Consider an application that needs to execute queries over various
>>> subsets of a database's contents, where the subsets are defined using
>>> various combinations of named graphs. It would certainly be useful to
>>> have standard queries which only required the appropriate
>>> "FROM g1 FROM
>>> g2 etc" prepended. This is easy to do, unless one of the
>>> graphs is the
>>> default graph.
>>>
>>>> Finally, note that it is not possible in SPARQL1.1 to
>>> construct a *new* dataset composed of *parts* of the default
>>> dataset of an endpoint plus possible external graphs; such a
>>> feature currently not foreseen in the features addressed in
>>> this round of SPARQL, but had been suggested before [1].
>>>> The features being worked on in this round of
>>> standardization have been decided in a voting process at the
>>> beginning of the WG and are documented in the following
>>> document: http://www.w3.org/TR/sparql-features/
>>>> Additionally, a list of work items and features postponed
>>> to a future working group are being collected by the group in
>>> a dedicated wiki page [2] which also contains the features
>>> discussed in the beginning of the WG which have not been
>>> considered for this round [3].
>>>
>>> Yes, I will be more timely next time and will endeavour to
>>> progress this
>>> topic in the proper way. My apologies for the 'noise'.
>>>
>>> Regards,
>>> barry
>>>
>>>> Among this list, the feature "Composite Datasets" [1] might
>>> partially capture what you have in mind and a future WG might
>>> possibly work out the details of such feature.
>>>> We'd kindly ask you to confirm by a reply to this list that
>>> this addresses your comment.
>>>> Axel Polleres, on behalf of the SPARQL WG
>>>>
>>>> 1. http://www.w3.org/2009/sparql/wiki/Feature:CompositeDatasets
>>>> 2. http://www.w3.org/2009/sparql/wiki/Future_Work_Items
>>>> 3. http://www.w3.org/2009/sparql/wiki/Category:Features
>>>
>
>
Received on Friday, 7 September 2012 14:30:36 UTC