Re: RDF dataset/SPARQL endpoint descriptions from Andreas Langegger on 2009-08-04 (public-rdf-dawg-comments@w3.org from August 2009)

From: Andreas Langegger <al@jku.at>
Date: Tue, 4 Aug 2009 22:39:38 +0200
To: bnowack@semsol.com
Cc: public-rdf-dawg-comments@w3.org
Message-Id: <A2C110F8-7BDC-4613-B1DD-0DDBD3705B0B@jku.at>
## CONTINUED ##

Hi Benji,

argh, sorry, accidentially hit the send button before...

On Jul 27, 2009, at 5:40 PM, Benjamin Nowack wrote:

> Completing the endpoint description feature is still on my list for
> ARC, what I've got so far WRT to Q1 is that the endpoint returns the
> description directly at its URL when the accept header indicates RDF
> as preferred result format and when no query parameter is sent. It was
> pretty easy to implement and didn't require the invention of a syntax
> extension or magic queries/graphs. If the header checker ranks HTML
> over RDF, the usual input form is served.

Sounds reasonable, that's the HTTP header approach.

> I'm not so sure about using a special DESCRIBE as a basis as it would
> require squeezing the endpoint description feature into the query
> processor instead of being able to directly catch the description
> request at the endpoint level. It would be doable, though, and would
> enable the non-HTTP access you mentioned. Messing with headers (as in
> my approach) can be fragile.

You're right, but I suggest to discuss this now in progress of SPARQL/ 
Query 1.1 without taking into account implementation efforts in order  
to find the best approach in terms of architectural styles. I  
implemented DESCRIBE SELF and DESCRIBE DATASET in Joseki and will  
explain my findings. After some experiments, I came to the following  
conclusions:

1. I advocate for a separation of service description and dataset  
description:

The service description includes information such as supported query  
syntax, result formats, features such as fulltext search, entailment  
regimes, etc. and won't change in the short run as it depends on the  
implementation of the processor and SPARQL endpoint.

The dataset description is only related to the data behind the  
endpoint, independent of the access interface. It will change more  
frequently, especially if it contains dc:subjects and detailed  
statistics.

2. Differences between endpoint/dataset description for SPARQL  
endpoints and Linked Datasets:

Although both use cases have much in common, differences exist in  
terms of resource discovery and the kind of meta data. While  
describing SPARQL endpoints is similar to describing Web services with  
WSDL, describing Linked Datasets is similar to content discovery  
(robots.txt => sitemap.xml => void:Dataset...). As explained in 1.)  
SPARQL endpoints in addition have a technical service description,  
while Linked Datasets only have dataset descriptions (e.g. in voiD).  
Both use cases should be aligned though.

3. Requirements that should be fulfilled:

I.   It should be possible to fetch SPARQL endpoint meta data via the  
endpoint itself and not via some "well-known" HTTP URI such as / 
robots.txt etc.
II.  It should be possible for clients to just look at the HTTP  
headers and status code and see if endpoint meta data have changed  
(Last-modified). This is especially important when low-level  
statistics are provided and periodically fetched by clients (we will  
provide more details in future in the context of federation).
III. It should be possible to either completely fetch SPARQL endpoint  
and dataset description based on HTTP URI resolution or query it via  
the SPARQL endpoint (e.g. by supporting sub-queries in upcoming SPARQL/ 
Query 1.1.)

4. My proposal (which I've implemented in Joseki/ARQ):

--------------------
"DESCRIBE SELF" => returns SPARQL endpoint service description, e.g.:

<?xml version="1.0"?>
<rdf:RDF ...namespace declarations....>
  <sd:Service>
    <rdfs:label rdf:datatype="http://www.w3.org/2001/ 
XMLSchema#string">Joseki SPARQL Endpoint</rdfs:label>
    <saddle:resultFormat rdf:parseType="Resource">
      <saddle:spec rdf:resource="http://www.w3.org/TR/rdf-sparql-json-res/ 
"/>
      <saddle:mediaType rdf:datatype="http://www.w3.org/2001/XMLSchema#string 
">application/sparql-results+json</saddle:mediaType>
      <rdfs:label rdf:datatype="http://www.w3.org/2001/ 
XMLSchema#string">SPARQL/JSON</rdfs:label>
    </saddle:resultFormat>
    <saddle:queryLanguage rdf:parseType="Resource">
      <saddle:spec rdf:resource="http://www.w3.org/TR/rdf-sparql- 
query/"/>
      <rdfs:label rdf:datatype="http://www.w3.org/2001/ 
XMLSchema#string">SPARQL</rdfs:label>
    </saddle:queryLanguage>
<!-- LINK to the dataset-specific meta data in voiD -->
    <saddle:dataSet rdf:resource="http://midearth:8900/void"/>
...
  </sd:Service>
</rdf:RDF>

As explained in 1. only the service description is returned, but a  
resolvable URI for the dataset description is included.

--------------------
"DESCRIBE DATASET" => returns data set description in voiD and  
RDFStats, e.g.:

<?xml version="1.0"?>
<rdf:RDF
     xmlns:scovo="http://purl.org/NET/scovo#"
     xmlns:stats="http://rdfstats.sourceforge.net/vocab/rdfstats#"
     xmlns:void="http://rdfs.org/ns/void#" ...other namespaces...>
   <void:Dataset rdf:about="http://midearth:8900/void#">
     <foaf:homepage rdf:resource="http://midearth:8900/"/>
     <dcterms:description>A SemWIQ-powered Linked Data endpoint  
providing RDF data via SPARQL at http://midearth:8900/sparql and  
linked data at http://midearth:8900/resource/...</dcterms:description>
     <dcterms:title>BSBM</dcterms:title>
   </void:Dataset>
   <stats:PropertyHistogram>
     <rdf:value>ATKM/ 
WrFOrYAAAAUaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEjZGF0ZVRpbWUDAAAB
Fy1IeYAAAAEaS0s/ 
AAAAAAAAAAACAAAAAAAAAAQAAAAAAAAAAgAAAAAAAAABAAAAAAAAAAMAAAAA
AAAABAAAAAAAAAADAAAAAAAAAAQAAAAAAAAAAwAAAAAAAAAKAAAAAAAAAAYAAAAAAAAACQAAAAAA
AAAGAAAAAAAAAAYAAAAAAAAAAQAAAAAAAAABAAAAAAAAAAYAAAAAAAAABAAAAAAAAAAEAAAAAAAA
AAEAAAAAAAAAOw==</rdf:value>
     <scovo:dataset>
       <stats:RDFStatsDataset rdf:nodeID="A0">
         <dc:creator>dorgon@midearth</dc:creator>
         <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime 
"
         >2009-08-04T20:28:58.084Z</dc:date>
         <stats:sourceType rdf:resource="http://rdfstats.sourceforge.net/vocab/rdfstats#SPARQLEndpoint 
"/>
         <stats:sourceUrl rdf:resource="http://midearth:8900/sparql"/>
       </stats:RDFStatsDataset>
     </scovo:dataset>
     <stats:propertyDimension rdf:resource="http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/validFrom 
"/>
     <stats:rangeDimension rdf:resource="http://www.w3.org/2001/XMLSchema#dateTime 
"/>
   </stats:PropertyHistogram>
...

In order to query the meta data I currently allow to specify sources  
using FROM:
SELECT * FROM <http://midearth:8900/void> WHERE { ?s a  
stats:RDFStatsDataset ; ?p ?o }

In future this could be supported without an external FROM source  
using sub queries such as
SELECT * FROM { DESCRIBE DATASET } WHERE { ... }

> In case of DESCRIBE, I think I'd personally prefer a "DESCRIBE <>"
> instead of "DESCRIBE SELF" and then directly serve the endpoint
> description, not some intermediate URI (but I see the potential
> "too much data" problem with a single description).

Dataset meta data can include statistics in order to support  
estimation of expected result sizes and federation in future - we will  
provide details soon. Thus, the result can be several kilobytes to  
megabytes if the dataset is huge. The service description however will  
be rather small in general.

> I'm totall undecided re the actual data to serve, though. My immediate
> preference/need is statistics, namespaces, extensions (like COUNT,  
> LOAD
> etc.) and resource types, but I miss a vocabulary that is optimized  
> for
> the given use case and which also feels stable and/or maintained.

I also miss such a vocabulary. I would be happy to address this at a  
VoCamp or at ISWC together with you, Michael Hausenblas, Jürgen  
Umbrich, Axel Polleres, and others...

Regards,
Andy

>> Hello,
>> (CCed to public-lod, I suggest to continue the discussion in the DAWG
>> list)
>>
>> I would like to kick off a general discussion regarding RDF dataset
>> meta data (e.g. voiD [1]), SPARQL endpoint descriptions [2] and the
>> possible integration of statistics (e.g. RDFStats [3]) to support
>> Semantic Web middleware/applications. The general opinion regarding
>> SPARQL extensions for the next REC seems to be staying as simple as
>> possible (e.g. recent discussion regarding fulltext search [4]). I
>> think, while the standard should remain simple (SPARQL-compliant
>> implementations should be possible with little effort and small  
>> size),
>> it would be very useful to provide at least extension points that can
>> be further standardized by the community. For instance, full-text
>> search capability, aggregates, initial bindings, etc. could be
>> announced by the endpoint such as here: http://kasei.us/sparql? 
>> about=1
>>
>> Similarly, it should be possible to announce the availability of voiD
>> metadata and statistics. While [2] is targeted to SPARQL endpoints
>> only including non-HTTP protocols such as ODBC (Virtuoso), voiD [1]  
>> is
>> targeted to LOD datasets in general which may be available in
>> different forms (single RDF documents, data dumps, RDFa, and SPARQL
>> endpoints. I think it is very important to find a best-practice
>> solution which integrates voiD, future SPARQL endpoint descriptions,
>> and a consensus on statistics (possibly based on SCOVO [5] - which
>> should be improved, see [6]).
>>
>> Main questions include:
>>
>> Q1) How to provide/consume endpoint descriptions in general?
>>   The authors of voiD [1] suggest back-linking from resources
>> (documents, dumps, etc.) to a voiD dataset.
>>   In case of a SPARQL endpoint they suggest discovery via
>> sitemap.xml ([1] 5.2.)
>>
>>   Problem:
>>     Only works via HTTP, only works for 1 endpoint per domain.
>>   Sub-questions:
>>     a) Should non-HTTP protocols be supported?
>>     b) Should multiple SPARQL endpoints per domain be possible? -
>> In my opinion it should.
>>   Other suggestions based on [2]:
>>     1. SPARQL extension, like "DESCRIBE SELF" (by AndyS)
>>        1.1. could return a resolvale URI of the void:Dataset
>>        1.2. could return the URI of a named graph to query (works
>> with non-HTTP protocols)
>>     2. HTTP header, e.g. X-endpoint-description:
>> http://kasei.us/sparql?about=1
>>     3. new protocol operation: HTTP OPTIONS for returning the
>> description
>>     4. Named graph
>>        4.1. graph IRI retrieved with DESCRIBE SELF (see 1.2.)
>>        4.2. graph IRI == SPARQL endpoint URI
>>
>> Q2) Which metadata (w/o statistics) to include?
>>   1. Is there any problem with voiD or what would be missing in
>> voiD for SPARQL endpoints?
>>
>> Q3) Which statistics to include?
>>   1. simple counts for resources in total, per class / untyped
>>   2. number of documents in case of data dumps
>>   3. selectivities for properties (untyped and with given class)
>>   4. histograms for property values (untyped and with given class,
>> can be generated with [3])
>>   5. Is SCOVO sufficient?
>>      5.1. A bit verbose: dimensions should be simplyfied [6].
>>      5.2. Encoding histograms not trivial: SCOVO min/max only
>> useful for integers and nominal scales, RDFStats uses base64-encoded
>> literals
>>
>> Q4) Some metadata are SPARQL-endpoint specific and irrelevant for
>> datasets in general (i.e. collections/dumps/etc) - but common  
>> metadata
>> should be reused from voiD for maximum interoperability. We should
>> define a separate vocabulary for SPARQL endpoint descriptions, but  
>> re-
>> use common properties from voiD.
>>   Which metadata is SPARQL specific?
>>   a) full-text search...
>>
>> Since this may become a larger mindmap, it would be better to work it
>> out in a wiki. Any suggestions where to continue?
>>
>> Regards,
>> AndyL
>>
>>
>> [1] http://rdfs.org/ns/void-guide
>> [2] http://www.w3.org/2009/sparql/wiki/Feature:ServiceDescriptions
>> [3] http://rdfstats.sourceforge.net
>> [4]
>> http://www.nabble.com/Free-text-search-and-SPARQL-New-Features-and-Rationale-dr
>> aft-to24324606.html
>> [5] http://purl.org/NET/scovo#
>> [6] http://code.google.com/p/void-impl/issues/detail?id=18
>>
>> http://www.langegger.at
>> ----------------------------------------------------------------------
>> Dipl.-Ing.(FH) Andreas Langegger
>> FAW - Institute for Application-oriented Knowledge Processing
>> Johannes Kepler University Linz
>> A-4040 Linz, Altenberger Straße 69
>


http://www.langegger.at
----------------------------------------------------------------------
Dipl.-Ing.(FH) Andreas Langegger
FAW - Institute for Application-oriented Knowledge Processing
Johannes Kepler University Linz
A-4040 Linz, Altenberger Straße 69
Received on Tuesday, 4 August 2009 20:40:24 UTC