RDF dataset/SPARQL endpoint descriptions

Hello,
(CCed to public-lod, I suggest to continue the discussion in the DAWG  
list)

I would like to kick off a general discussion regarding RDF dataset  
meta data (e.g. voiD [1]), SPARQL endpoint descriptions [2] and the  
possible integration of statistics (e.g. RDFStats [3]) to support  
Semantic Web middleware/applications. The general opinion regarding  
SPARQL extensions for the next REC seems to be staying as simple as  
possible (e.g. recent discussion regarding fulltext search [4]). I  
think, while the standard should remain simple (SPARQL-compliant  
implementations should be possible with little effort and small size),  
it would be very useful to provide at least extension points that can  
be further standardized by the community. For instance, full-text  
search capability, aggregates, initial bindings, etc. could be  
announced by the endpoint such as here: http://kasei.us/sparql?about=1

Similarly, it should be possible to announce the availability of voiD  
metadata and statistics. While [2] is targeted to SPARQL endpoints  
only including non-HTTP protocols such as ODBC (Virtuoso), voiD [1] is  
targeted to LOD datasets in general which may be available in  
different forms (single RDF documents, data dumps, RDFa, and SPARQL  
endpoints. I think it is very important to find a best-practice  
solution which integrates voiD, future SPARQL endpoint descriptions,  
and a consensus on statistics (possibly based on SCOVO [5] - which  
should be improved, see [6]).

Main questions include:

Q1) How to provide/consume endpoint descriptions in general?
     The authors of voiD [1] suggest back-linking from resources  
(documents, dumps, etc.) to a voiD dataset.
     In case of a SPARQL endpoint they suggest discovery via  
sitemap.xml ([1] 5.2.)

     Problem:
       Only works via HTTP, only works for 1 endpoint per domain.
     Sub-questions:
       a) Should non-HTTP protocols be supported?
       b) Should multiple SPARQL endpoints per domain be possible? -  
In my opinion it should.
     Other suggestions based on [2]:
       1. SPARQL extension, like "DESCRIBE SELF" (by AndyS)
          1.1. could return a resolvale URI of the void:Dataset
          1.2. could return the URI of a named graph to query (works  
with non-HTTP protocols)
       2. HTTP header, e.g. X-endpoint-description: http://kasei.us/sparql?about=1
       3. new protocol operation: HTTP OPTIONS for returning the  
description
       4. Named graph
          4.1. graph IRI retrieved with DESCRIBE SELF (see 1.2.)
          4.2. graph IRI == SPARQL endpoint URI

Q2) Which metadata (w/o statistics) to include?
     1. Is there any problem with voiD or what would be missing in  
voiD for SPARQL endpoints?

Q3) Which statistics to include?
     1. simple counts for resources in total, per class / untyped
     2. number of documents in case of data dumps
     3. selectivities for properties (untyped and with given class)
     4. histograms for property values (untyped and with given class,  
can be generated with [3])
     5. Is SCOVO sufficient?
        5.1. A bit verbose: dimensions should be simplyfied [6].
        5.2. Encoding histograms not trivial: SCOVO min/max only  
useful for integers and nominal scales, RDFStats uses base64-encoded  
literals

Q4) Some metadata are SPARQL-endpoint specific and irrelevant for  
datasets in general (i.e. collections/dumps/etc) - but common metadata  
should be reused from voiD for maximum interoperability. We should  
define a separate vocabulary for SPARQL endpoint descriptions, but re- 
use common properties from voiD.
     Which metadata is SPARQL specific?
     a) full-text search...

Since this may become a larger mindmap, it would be better to work it  
out in a wiki. Any suggestions where to continue?

Regards,
AndyL


[1] http://rdfs.org/ns/void-guide
[2] http://www.w3.org/2009/sparql/wiki/Feature:ServiceDescriptions
[3] http://rdfstats.sourceforge.net
[4] http://www.nabble.com/Free-text-search-and-SPARQL-New-Features-and-Rationale-draft-to24324606.html
[5] http://purl.org/NET/scovo#
[6] http://code.google.com/p/void-impl/issues/detail?id=18

http://www.langegger.at
----------------------------------------------------------------------
Dipl.-Ing.(FH) Andreas Langegger
FAW - Institute for Application-oriented Knowledge Processing
Johannes Kepler University Linz
A-4040 Linz, Altenberger Straße 69

Received on Friday, 10 July 2009 17:04:02 UTC