Re: RDF dataset/SPARQL endpoint descriptions from Benjamin Nowack on 2009-07-27 (public-rdf-dawg-comments@w3.org from July 2009)

From: Benjamin Nowack <bnowack@semsol.com>
Date: Mon, 27 Jul 2009 17:40:31 +0200
To: Andreas Langegger <al@jku.at>
Cc: public-rdf-dawg-comments@w3.org
Message-ID: <PM-GA.20090727174031.AA329.1.1D@semsol.com>
Hi Andreas,

Thanks for starting this thread.

Completing the endpoint description feature is still on my list for 
ARC, what I've got so far WRT to Q1 is that the endpoint returns the 
description directly at its URL when the accept header indicates RDF 
as preferred result format and when no query parameter is sent. It was 
pretty easy to implement and didn't require the invention of a syntax 
extension or magic queries/graphs. If the header checker ranks HTML 
over RDF, the usual input form is served.

I'm not so sure about using a special DESCRIBE as a basis as it would 
require squeezing the endpoint description feature into the query 
processor instead of being able to directly catch the description
request at the endpoint level. It would be doable, though, and would
enable the non-HTTP access you mentioned. Messing with headers (as in
my approach) can be fragile.

In case of DESCRIBE, I think I'd personally prefer a "DESCRIBE <>" 
instead of "DESCRIBE SELF" and then directly serve the endpoint 
description, not some intermediate URI (but I see the potential 
"too much data" problem with a single description).

I'm totall undecided re the actual data to serve, though. My immediate
preference/need is statistics, namespaces, extensions (like COUNT, LOAD
etc.) and resource types, but I miss a vocabulary that is optimized for
the given use case and which also feels stable and/or maintained.

Looking forward to the outcome of the service description task.

Cheers,
Benji

--
Benjamin Nowack
http://bnode.org/
http://semsol.com/




>Hello,
>(CCed to public-lod, I suggest to continue the discussion in the DAWG  
>list)
>
>I would like to kick off a general discussion regarding RDF dataset  
>meta data (e.g. voiD [1]), SPARQL endpoint descriptions [2] and the  
>possible integration of statistics (e.g. RDFStats [3]) to support  
>Semantic Web middleware/applications. The general opinion regarding  
>SPARQL extensions for the next REC seems to be staying as simple as  
>possible (e.g. recent discussion regarding fulltext search [4]). I  
>think, while the standard should remain simple (SPARQL-compliant  
>implementations should be possible with little effort and small size),  
>it would be very useful to provide at least extension points that can  
>be further standardized by the community. For instance, full-text  
>search capability, aggregates, initial bindings, etc. could be  
>announced by the endpoint such as here: http://kasei.us/sparql?about=1
>
>Similarly, it should be possible to announce the availability of voiD  
>metadata and statistics. While [2] is targeted to SPARQL endpoints  
>only including non-HTTP protocols such as ODBC (Virtuoso), voiD [1] is  
>targeted to LOD datasets in general which may be available in  
>different forms (single RDF documents, data dumps, RDFa, and SPARQL  
>endpoints. I think it is very important to find a best-practice  
>solution which integrates voiD, future SPARQL endpoint descriptions,  
>and a consensus on statistics (possibly based on SCOVO [5] - which  
>should be improved, see [6]).
>
>Main questions include:
>
>Q1) How to provide/consume endpoint descriptions in general?
>     The authors of voiD [1] suggest back-linking from resources  
>(documents, dumps, etc.) to a voiD dataset.
>     In case of a SPARQL endpoint they suggest discovery via  
>sitemap.xml ([1] 5.2.)
>
>     Problem:
>       Only works via HTTP, only works for 1 endpoint per domain.
>     Sub-questions:
>       a) Should non-HTTP protocols be supported?
>       b) Should multiple SPARQL endpoints per domain be possible? -  
>In my opinion it should.
>     Other suggestions based on [2]:
>       1. SPARQL extension, like "DESCRIBE SELF" (by AndyS)
>          1.1. could return a resolvale URI of the void:Dataset
>          1.2. could return the URI of a named graph to query (works  
>with non-HTTP protocols)
>       2. HTTP header, e.g. X-endpoint-description:
>http://kasei.us/sparql?about=1
>       3. new protocol operation: HTTP OPTIONS for returning the  
>description
>       4. Named graph
>          4.1. graph IRI retrieved with DESCRIBE SELF (see 1.2.)
>          4.2. graph IRI == SPARQL endpoint URI
>
>Q2) Which metadata (w/o statistics) to include?
>     1. Is there any problem with voiD or what would be missing in  
>voiD for SPARQL endpoints?
>
>Q3) Which statistics to include?
>     1. simple counts for resources in total, per class / untyped
>     2. number of documents in case of data dumps
>     3. selectivities for properties (untyped and with given class)
>     4. histograms for property values (untyped and with given class,  
>can be generated with [3])
>     5. Is SCOVO sufficient?
>        5.1. A bit verbose: dimensions should be simplyfied [6].
>        5.2. Encoding histograms not trivial: SCOVO min/max only  
>useful for integers and nominal scales, RDFStats uses base64-encoded  
>literals
>
>Q4) Some metadata are SPARQL-endpoint specific and irrelevant for  
>datasets in general (i.e. collections/dumps/etc) - but common metadata  
>should be reused from voiD for maximum interoperability. We should  
>define a separate vocabulary for SPARQL endpoint descriptions, but re- 
>use common properties from voiD.
>     Which metadata is SPARQL specific?
>     a) full-text search...
>
>Since this may become a larger mindmap, it would be better to work it  
>out in a wiki. Any suggestions where to continue?
>
>Regards,
>AndyL
>
>
>[1] http://rdfs.org/ns/void-guide
>[2] http://www.w3.org/2009/sparql/wiki/Feature:ServiceDescriptions
>[3] http://rdfstats.sourceforge.net
>[4]
>http://www.nabble.com/Free-text-search-and-SPARQL-New-Features-and-Rationale-dr
>aft-to24324606.html
>[5] http://purl.org/NET/scovo#
>[6] http://code.google.com/p/void-impl/issues/detail?id=18
>
>http://www.langegger.at
>----------------------------------------------------------------------
>Dipl.-Ing.(FH) Andreas Langegger
>FAW - Institute for Application-oriented Knowledge Processing
>Johannes Kepler University Linz
>A-4040 Linz, Altenberger Straße 69
Received on Monday, 27 July 2009 15:41:13 UTC