Re: SD vocab updates: dataset descriptions from Andy Seaborne on 2009-11-22 (public-rdf-dawg@w3.org from October to December 2009)

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Sun, 22 Nov 2009 22:35:05 +0000
To: Gregory Williams <greg@evilfunhouse.com>
CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4B09BC99.8030505@talis.com>
On 21/11/2009 00:08, Gregory Williams wrote:
> I've been updating the service description vocabulary based on the
> discussion at the F2F, and wanted to run the changes by everyone
> (since many of you weren't at the F2F).
>
> The primary change is related to the link between a SPARQL service and
> a dataset description. This is where we're going to be punting a bit
> to other vocabularies such as voiD, and letting them do the actual
> dataset descriptions. However, as I discussed at the F2F[1], I thought
> we needed a way to link a dataset to its default graph since the
> default graph is very SPARQL-specific and not something likely to show
> up in a mroe general dataset description vocabulary. Given this, I've
> changed the SD vocab in the following ways:

The diagram [1] has dataset(s) but also the available universe.
Is that related to describing update services as well?

> * Added a sd:Dataset class. I'm hoping we can work with the voiD group
> to make sure this aligns with their notion of a dataset so voiD
> properties could be attached to a sd:Dataset.

+1 to working with the voiD group on this.

voiD can be used to describe
"""
A dataset is a collection of data, published and maintained by a single 
provider, available as RDF on the Web, where at least some of the 
resources in the dataset are identified by dereferencable URIs.
"""
and there are properties that imply it's one graph: void:dataDump points 
to an RDF graph so can't be applied to a whole RDF dataset.

I think we will need to define the structural elements of an RDF dataset 
(sparql:Dataset) if that isn't available elsewhere.

Reusing is ideal but an RDF dataset is something from this WG (DAWG++) 
so defining it's structural elements seems reasonable.  Better than 
publishing SD without there being any candidate vocabulary.

I have used, in Joseki and ARQ, the following:

----------
## Dataset

:RDFDataSet     rdf:type    rdfs:Class .

# Points to the description.
:defaultGraph   rdf:type    rdf:Property .

# Points to the name + description
:namedGraph     rdf:type    rdf:Property .

## .. given by:
:graphName     rdf:type     rdf:Property .
:graphData     rdf:type     rdf:Property .
----------

Example 1:: default graph only:

# A dataset of one model as the default graph, data loaded from a file.
<#books>   rdf:type ja:RDFDataset ;
     rdfs:label "Books" ;
     ja:defaultGraph
       [ rdfs:label "books.n3" ;
         ... load content from ...
       ]
     .

Example 2:: with named graphs:

<#ds1>   rdf:type ja:RDFDataset ;
     ja:defaultGraph    <#model1> ;
     rdfs:label "Dataset 1" ;
     ja:namedGraph
         [ ja:graphName      <http://example.org/name1> ;
           ja:graph          <#model1> ] ;
     ja:namedGraph
         [ ja:graphName      <http://example.org/name2> ;
           ja:graph          <#model2>
         ] ;
     .

ARQ and Joseki use this to assemble datasets.  Our graph descriptions 
are declarative descriptions of how to build a graph so not applicable 
here.  That's why it's "graphData" but the name isn't so appropriate here.

FYI: http://openjena.org/assembler/

but the general structure of:


[] a sparql:Dataset ;
    :defaultGraph [ a :Graph  ; ... ] ;
    :namedGraph [ :name "http://example/graph1" ;
                 :contents
                 [ a :Graph ;
                   ....
                 ] ;
   :namedGraph [ ... ]
   .

is applicable and has worked well for us.

I used sparql:Dataset rather than sd: as it's a general concept in 
SPARQL.  Looking at the SD doc, it seems to me that the sd: classes are 
about SPARQL concepts.

> * Replaced sd:datasetDescription with two new terms: sd:defaultDataset
> and sd:availableDataset. sd:defaultDataset links a sd:Service with a
> description of the default dataset used for query answering if none is
> provided by the query or protocol. It may use the defaultGraph
> property described below. sd:availableDataset links a sd:Service with
> a description of a dataset containing named graphs that may be used in
> FROM/FROM NAMED clauses.

Probably needed but does it not come through lists of all graphs at a 
service?

The concept of availableDataset seems to mix two things.

1/ It's the enumerating the datasets from limited choices of FROM/FROM 
NAMED:

   FROM one of ...
   FROM NAMED one of ....

If you have 5 graphs then I made it 150 possible distinct datasets :-) 
without allowing for union graphs. (Sum of the n'th row of Pascal's 
triangle for named graphs * N for choices of default dataset. Not all 
combinations are useful in practice).

Maybe describing the range of values for FROM/FROM NAMED is better.

2/ There are some preselected ways to name the dataset description with 
only some of all datasets possible from all graph names over FROM NAMED.
This can only be done by conditional choice of FROM, FROM NAMED without 
some extension e.g. rejest queries that don't meet additional rules like 
at most 3 FROM NAMED.  In this case, different service URLs for 
different datasets can be used.  Otherwise we are defining a new 1st 
class concept and that might be better done waiting until the next WG. 
(Not sure.)

> * Added URL variants of the above two terms: sd:defaultDatasetURL and
> sd:availableDatasetURL. These are meant to allow linking not to the
> dataset description directly but to a dereferencable document that
> contains such descriptions. This allows the service description to be
> kept small while providing access to very large dataset descriptions.
>
> * Added sd:defaultGraph term for linking a sd:Dataset with a
> description of the default graph in a dataset. For now I'm leaving the
> rdfs:range of this term open, allowing vocabularies like voiD to do it
> themselves.

See above. And sd:namedGraph as presumably the named graphs may not be 
the entire available universe (or an end point) if I read the diagram 
correctly.

>
>
> I'd like to get some feedback on these changes from the group. In
> particular, I'm curious about people's feelings on two issues:
>
> (1) should the use of sd:availableDataset imply that the endpoint will
> only allow use of the named graphs in FROM/FROM NAMED clauses, or
> could it be used simply to link to locally cached/generated
> descrptions of commonly used datasets? If the latter, we (or somebody
> else) could coin a sd:feature IRI to indicate that an endpoint has the
> ability to dereference graph URLs.

See above and +1 to a feature URI for the offer to deference.

>
> (2) How do people feel about the URL variants of the dataset
> properties? I know several people had indicated that they wanted a way
> to link to dataset descriptions without including them in the service
> description and these terms were created to satisfy that
> need. However, the logistics of actually using the terms feel a bit
> strange to me (do you just search for any dataset instance in the
> retrieved RDF similar to how foaf:PersonalProfileDocument is used?)
> and there are other ways this could be handled (we might assume that
> if a dataset description isn't in the SD RDF then the dataset IRI is
> dereferencable and will return the description).

Feels strange to me too but I do see the practical issue of the whole 
description being quite large and a core smaller set being what most 
apps need to start with.  Short of SPARQL queries over the SD, this 
split seems the way forward IMHO.

sd:defaultDatasetURL and sd:availableDatasetURL are like rdfs:seeAlso, 
but more specific.

>
> Thoughts?
>
> thanks,
> .greg
>
>
> [1] Whiteboard darwing of the dataset description modeling from F2F2:
> <http://thefigtrees.net/lee/dl/sparql-IMG00009-20091103-1508.jpg>. The
> the unlabeled blue arc should be "default graph".
>
>

	Andy
Received on Sunday, 22 November 2009 22:35:38 UTC