Re: RDF dataset/SPARQL endpoint descriptions

Hi Andy,

Thanks for sharing the experience. Separating the service from the
dataset makes a lot of sense, I like that idea.

Cheers,
Benji


On 04.08.2009 22:39:38, Andreas Langegger wrote:
>## CONTINUED ##
>
>Hi Benji,
>
>argh, sorry, accidentially hit the send button before...
>
>On Jul 27, 2009, at 5:40 PM, Benjamin Nowack wrote:
>
>> Completing the endpoint description feature is still on my list for
>> ARC, what I've got so far WRT to Q1 is that the endpoint returns the
>> description directly at its URL when the accept header indicates RDF
>> as preferred result format and when no query parameter is sent. It was
>> pretty easy to implement and didn't require the invention of a syntax
>> extension or magic queries/graphs. If the header checker ranks HTML
>> over RDF, the usual input form is served.
>
>Sounds reasonable, that's the HTTP header approach.
>
>> I'm not so sure about using a special DESCRIBE as a basis as it would
>> require squeezing the endpoint description feature into the query
>> processor instead of being able to directly catch the description
>> request at the endpoint level. It would be doable, though, and would
>> enable the non-HTTP access you mentioned. Messing with headers (as in
>> my approach) can be fragile.
>
>You're right, but I suggest to discuss this now in progress of SPARQL/ 
>Query 1.1 without taking into account implementation efforts in order  
>to find the best approach in terms of architectural styles. I  
>implemented DESCRIBE SELF and DESCRIBE DATASET in Joseki and will  
>explain my findings. After some experiments, I came to the following  
>conclusions:
>
>1. I advocate for a separation of service description and dataset  
>description:
>
>The service description includes information such as supported query  
>syntax, result formats, features such as fulltext search, entailment  
>regimes, etc. and won't change in the short run as it depends on the  
>implementation of the processor and SPARQL endpoint.
>
>The dataset description is only related to the data behind the  
>endpoint, independent of the access interface. It will change more  
>frequently, especially if it contains dc:subjects and detailed  
>statistics.
>
>2. Differences between endpoint/dataset description for SPARQL  
>endpoints and Linked Datasets:
>
>Although both use cases have much in common, differences exist in  
>terms of resource discovery and the kind of meta data. While  
>describing SPARQL endpoints is similar to describing Web services with  
>WSDL, describing Linked Datasets is similar to content discovery  
>(robots.txt => sitemap.xml => void:Dataset...). As explained in 1.)  
>SPARQL endpoints in addition have a technical service description,  
>while Linked Datasets only have dataset descriptions (e.g. in voiD).  
>Both use cases should be aligned though.
>
>3. Requirements that should be fulfilled:
>
>I.   It should be possible to fetch SPARQL endpoint meta data via the  
>endpoint itself and not via some "well-known" HTTP URI such as / 
>robots.txt etc.
>II.  It should be possible for clients to just look at the HTTP  
>headers and status code and see if endpoint meta data have changed  
>(Last-modified). This is especially important when low-level  
>statistics are provided and periodically fetched by clients (we will  
>provide more details in future in the context of federation).
>III. It should be possible to either completely fetch SPARQL endpoint  
>and dataset description based on HTTP URI resolution or query it via  
>the SPARQL endpoint (e.g. by supporting sub-queries in upcoming SPARQL/ 
>Query 1.1.)
>
>4. My proposal (which I've implemented in Joseki/ARQ):
>
>--------------------
>"DESCRIBE SELF" => returns SPARQL endpoint service description, e.g.:
>
><?xml version="1.0"?>
><rdf:RDF ...namespace declarations....>
>  <sd:Service>
>    <rdfs:label rdf:datatype="http://www.w3.org/2001/ 
>XMLSchema#string">Joseki SPARQL Endpoint</rdfs:label>
>    <saddle:resultFormat rdf:parseType="Resource">
>      <saddle:spec rdf:resource="http://www.w3.org/TR/rdf-sparql-json-res/ 
>"/>
>      <saddle:mediaType rdf:datatype="http://www.w3.org/2001/XMLSchema#string 
>">application/sparql-results+json</saddle:mediaType>
>      <rdfs:label rdf:datatype="http://www.w3.org/2001/ 
>XMLSchema#string">SPARQL/JSON</rdfs:label>
>    </saddle:resultFormat>
>    <saddle:queryLanguage rdf:parseType="Resource">
>      <saddle:spec rdf:resource="http://www.w3.org/TR/rdf-sparql- 
>query/"/>
>      <rdfs:label rdf:datatype="http://www.w3.org/2001/ 
>XMLSchema#string">SPARQL</rdfs:label>
>    </saddle:queryLanguage>
><!-- LINK to the dataset-specific meta data in voiD -->
>    <saddle:dataSet rdf:resource="http://midearth:8900/void"/>
>...
>  </sd:Service>
></rdf:RDF>
>
>As explained in 1. only the service description is returned, but a  
>resolvable URI for the dataset description is included.
>
>--------------------
>"DESCRIBE DATASET" => returns data set description in voiD and  
>RDFStats, e.g.:
>
><?xml version="1.0"?>
><rdf:RDF
>     xmlns:scovo="http://purl.org/NET/scovo#"
>     xmlns:stats="http://rdfstats.sourceforge.net/vocab/rdfstats#"
>     xmlns:void="http://rdfs.org/ns/void#" ...other namespaces...>
>   <void:Dataset rdf:about="http://midearth:8900/void#">
>     <foaf:homepage rdf:resource="http://midearth:8900/"/>
>     <dcterms:description>A SemWIQ-powered Linked Data endpoint  
>providing RDF data via SPARQL at http://midearth:8900/sparql and  
>linked data at http://midearth:8900/resource/...</dcterms:description>
>     <dcterms:title>BSBM</dcterms:title>
>   </void:Dataset>
>   <stats:PropertyHistogram>
>     <rdf:value>ATKM/ 
>WrFOrYAAAAUaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEjZGF0ZVRpbWUDAAAB
>Fy1IeYAAAAEaS0s/ 
>AAAAAAAAAAACAAAAAAAAAAQAAAAAAAAAAgAAAAAAAAABAAAAAAAAAAMAAAAA
>=
>AAAABAAAAAAAAAADAAAAAAAAAAQAAAAAAAAAAwAAAAAAAAAKAAAAAAAAAAYAAAAAAAAACQAAAAAA
>=
>AAAGAAAAAAAAAAYAAAAAAAAAAQAAAAAAAAABAAAAAAAAAAYAAAAAAAAABAAAAAAAAAAEAAAAAAAA
>AAEAAAAAAAAAOw==</rdf:value>
>     <scovo:dataset>
>       <stats:RDFStatsDataset rdf:nodeID="A0">
>         <dc:creator>dorgon@midearth</dc:creator>
>         <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime 
>"
>         >2009-08-04T20:28:58.084Z</dc:date>
>         <stats:sourceType
>rdf:resource="http://rdfstats.sourceforge.net/vocab/rdfstats#SPARQLEndpoint 
>"/>
>         <stats:sourceUrl rdf:resource="http://midearth:8900/sparql"/>
>       </stats:RDFStatsDataset>
>     </scovo:dataset>
>     <stats:propertyDimension
>rdf:resource="http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/validFr
>om 
>"/>
>     <stats:rangeDimension
>rdf:resource="http://www.w3.org/2001/XMLSchema#dateTime 
>"/>
>   </stats:PropertyHistogram>
>...
>
>In order to query the meta data I currently allow to specify sources  
>using FROM:
>SELECT * FROM <http://midearth:8900/void> WHERE { ?s a  
>stats:RDFStatsDataset ; ?p ?o }
>
>In future this could be supported without an external FROM source  
>using sub queries such as
>SELECT * FROM { DESCRIBE DATASET } WHERE { ... }
>
>> In case of DESCRIBE, I think I'd personally prefer a "DESCRIBE <>"
>> instead of "DESCRIBE SELF" and then directly serve the endpoint
>> description, not some intermediate URI (but I see the potential
>> "too much data" problem with a single description).
>
>Dataset meta data can include statistics in order to support  
>estimation of expected result sizes and federation in future - we will  
>provide details soon. Thus, the result can be several kilobytes to  
>megabytes if the dataset is huge. The service description however will  
>be rather small in general.
>
>> I'm totall undecided re the actual data to serve, though. My immediate
>> preference/need is statistics, namespaces, extensions (like COUNT,  
>> LOAD
>> etc.) and resource types, but I miss a vocabulary that is optimized  
>> for
>> the given use case and which also feels stable and/or maintained.
>
>I also miss such a vocabulary. I would be happy to address this at a  
>VoCamp or at ISWC together with you, Michael Hausenblas, Jürgen  
>Umbrich, Axel Polleres, and others...
>
>Regards,
>Andy
>
>>> Hello,
>>> (CCed to public-lod, I suggest to continue the discussion in the DAWG
>>> list)
>>>
>>> I would like to kick off a general discussion regarding RDF dataset
>>> meta data (e.g. voiD [1]), SPARQL endpoint descriptions [2] and the
>>> possible integration of statistics (e.g. RDFStats [3]) to support
>>> Semantic Web middleware/applications. The general opinion regarding
>>> SPARQL extensions for the next REC seems to be staying as simple as
>>> possible (e.g. recent discussion regarding fulltext search [4]). I
>>> think, while the standard should remain simple (SPARQL-compliant
>>> implementations should be possible with little effort and small  
>>> size),
>>> it would be very useful to provide at least extension points that can
>>> be further standardized by the community. For instance, full-text
>>> search capability, aggregates, initial bindings, etc. could be
>>> announced by the endpoint such as here: http://kasei.us/sparql? 
>>> about=1
>>>
>>> Similarly, it should be possible to announce the availability of voiD
>>> metadata and statistics. While [2] is targeted to SPARQL endpoints
>>> only including non-HTTP protocols such as ODBC (Virtuoso), voiD [1]  
>>> is
>>> targeted to LOD datasets in general which may be available in
>>> different forms (single RDF documents, data dumps, RDFa, and SPARQL
>>> endpoints. I think it is very important to find a best-practice
>>> solution which integrates voiD, future SPARQL endpoint descriptions,
>>> and a consensus on statistics (possibly based on SCOVO [5] - which
>>> should be improved, see [6]).
>>>
>>> Main questions include:
>>>
>>> Q1) How to provide/consume endpoint descriptions in general?
>>>   The authors of voiD [1] suggest back-linking from resources
>>> (documents, dumps, etc.) to a voiD dataset.
>>>   In case of a SPARQL endpoint they suggest discovery via
>>> sitemap.xml ([1] 5.2.)
>>>
>>>   Problem:
>>>     Only works via HTTP, only works for 1 endpoint per domain.
>>>   Sub-questions:
>>>     a) Should non-HTTP protocols be supported?
>>>     b) Should multiple SPARQL endpoints per domain be possible? -
>>> In my opinion it should.
>>>   Other suggestions based on [2]:
>>>     1. SPARQL extension, like "DESCRIBE SELF" (by AndyS)
>>>        1.1. could return a resolvale URI of the void:Dataset
>>>        1.2. could return the URI of a named graph to query (works
>>> with non-HTTP protocols)
>>>     2. HTTP header, e.g. X-endpoint-description:
>>> http://kasei.us/sparql?about=1
>>>     3. new protocol operation: HTTP OPTIONS for returning the
>>> description
>>>     4. Named graph
>>>        4.1. graph IRI retrieved with DESCRIBE SELF (see 1.2.)
>>>        4.2. graph IRI == SPARQL endpoint URI
>>>
>>> Q2) Which metadata (w/o statistics) to include?
>>>   1. Is there any problem with voiD or what would be missing in
>>> voiD for SPARQL endpoints?
>>>
>>> Q3) Which statistics to include?
>>>   1. simple counts for resources in total, per class / untyped
>>>   2. number of documents in case of data dumps
>>>   3. selectivities for properties (untyped and with given class)
>>>   4. histograms for property values (untyped and with given class,
>>> can be generated with [3])
>>>   5. Is SCOVO sufficient?
>>>      5.1. A bit verbose: dimensions should be simplyfied [6].
>>>      5.2. Encoding histograms not trivial: SCOVO min/max only
>>> useful for integers and nominal scales, RDFStats uses base64-encoded
>>> literals
>>>
>>> Q4) Some metadata are SPARQL-endpoint specific and irrelevant for
>>> datasets in general (i.e. collections/dumps/etc) - but common  
>>> metadata
>>> should be reused from voiD for maximum interoperability. We should
>>> define a separate vocabulary for SPARQL endpoint descriptions, but  
>>> re-
>>> use common properties from voiD.
>>>   Which metadata is SPARQL specific?
>>>   a) full-text search...
>>>
>>> Since this may become a larger mindmap, it would be better to work it
>>> out in a wiki. Any suggestions where to continue?
>>>
>>> Regards,
>>> AndyL
>>>
>>>
>>> [1] http://rdfs.org/ns/void-guide
>>> [2] http://www.w3.org/2009/sparql/wiki/Feature:ServiceDescriptions
>>> [3] http://rdfstats.sourceforge.net
>>> [4]
>>>
>http://www.nabble.com/Free-text-search-and-SPARQL-New-Features-and-Rationale-dr
>>> aft-to24324606.html
>>> [5] http://purl.org/NET/scovo#
>>> [6] http://code.google.com/p/void-impl/issues/detail?id=18
>>>
>>> http://www.langegger.at
>>> ----------------------------------------------------------------------
>>> Dipl.-Ing.(FH) Andreas Langegger
>>> FAW - Institute for Application-oriented Knowledge Processing
>>> Johannes Kepler University Linz
>>> A-4040 Linz, Altenberger Straße 69
>>
>
>
>http://www.langegger.at
>----------------------------------------------------------------------
>Dipl.-Ing.(FH) Andreas Langegger
>FAW - Institute for Application-oriented Knowledge Processing
>Johannes Kepler University Linz
>A-4040 Linz, Altenberger Straße 69
>
>
>
>
>
>

Received on Wednesday, 5 August 2009 11:27:23 UTC