- From: Benjamin Nowack <bnowack@semsol.com>
- Date: Wed, 5 Aug 2009 13:26:43 +0200
- To: Andreas Langegger <al@jku.at>
- Cc: public-rdf-dawg-comments@w3.org
Hi Andy,
Thanks for sharing the experience. Separating the service from the
dataset makes a lot of sense, I like that idea.
Cheers,
Benji
On 04.08.2009 22:39:38, Andreas Langegger wrote:
>## CONTINUED ##
>
>Hi Benji,
>
>argh, sorry, accidentially hit the send button before...
>
>On Jul 27, 2009, at 5:40 PM, Benjamin Nowack wrote:
>
>> Completing the endpoint description feature is still on my list for
>> ARC, what I've got so far WRT to Q1 is that the endpoint returns the
>> description directly at its URL when the accept header indicates RDF
>> as preferred result format and when no query parameter is sent. It was
>> pretty easy to implement and didn't require the invention of a syntax
>> extension or magic queries/graphs. If the header checker ranks HTML
>> over RDF, the usual input form is served.
>
>Sounds reasonable, that's the HTTP header approach.
>
>> I'm not so sure about using a special DESCRIBE as a basis as it would
>> require squeezing the endpoint description feature into the query
>> processor instead of being able to directly catch the description
>> request at the endpoint level. It would be doable, though, and would
>> enable the non-HTTP access you mentioned. Messing with headers (as in
>> my approach) can be fragile.
>
>You're right, but I suggest to discuss this now in progress of SPARQL/
>Query 1.1 without taking into account implementation efforts in order
>to find the best approach in terms of architectural styles. I
>implemented DESCRIBE SELF and DESCRIBE DATASET in Joseki and will
>explain my findings. After some experiments, I came to the following
>conclusions:
>
>1. I advocate for a separation of service description and dataset
>description:
>
>The service description includes information such as supported query
>syntax, result formats, features such as fulltext search, entailment
>regimes, etc. and won't change in the short run as it depends on the
>implementation of the processor and SPARQL endpoint.
>
>The dataset description is only related to the data behind the
>endpoint, independent of the access interface. It will change more
>frequently, especially if it contains dc:subjects and detailed
>statistics.
>
>2. Differences between endpoint/dataset description for SPARQL
>endpoints and Linked Datasets:
>
>Although both use cases have much in common, differences exist in
>terms of resource discovery and the kind of meta data. While
>describing SPARQL endpoints is similar to describing Web services with
>WSDL, describing Linked Datasets is similar to content discovery
>(robots.txt => sitemap.xml => void:Dataset...). As explained in 1.)
>SPARQL endpoints in addition have a technical service description,
>while Linked Datasets only have dataset descriptions (e.g. in voiD).
>Both use cases should be aligned though.
>
>3. Requirements that should be fulfilled:
>
>I. It should be possible to fetch SPARQL endpoint meta data via the
>endpoint itself and not via some "well-known" HTTP URI such as /
>robots.txt etc.
>II. It should be possible for clients to just look at the HTTP
>headers and status code and see if endpoint meta data have changed
>(Last-modified). This is especially important when low-level
>statistics are provided and periodically fetched by clients (we will
>provide more details in future in the context of federation).
>III. It should be possible to either completely fetch SPARQL endpoint
>and dataset description based on HTTP URI resolution or query it via
>the SPARQL endpoint (e.g. by supporting sub-queries in upcoming SPARQL/
>Query 1.1.)
>
>4. My proposal (which I've implemented in Joseki/ARQ):
>
>--------------------
>"DESCRIBE SELF" => returns SPARQL endpoint service description, e.g.:
>
><?xml version="1.0"?>
><rdf:RDF ...namespace declarations....>
> <sd:Service>
> <rdfs:label rdf:datatype="http://www.w3.org/2001/
>XMLSchema#string">Joseki SPARQL Endpoint</rdfs:label>
> <saddle:resultFormat rdf:parseType="Resource">
> <saddle:spec rdf:resource="http://www.w3.org/TR/rdf-sparql-json-res/
>"/>
> <saddle:mediaType rdf:datatype="http://www.w3.org/2001/XMLSchema#string
>">application/sparql-results+json</saddle:mediaType>
> <rdfs:label rdf:datatype="http://www.w3.org/2001/
>XMLSchema#string">SPARQL/JSON</rdfs:label>
> </saddle:resultFormat>
> <saddle:queryLanguage rdf:parseType="Resource">
> <saddle:spec rdf:resource="http://www.w3.org/TR/rdf-sparql-
>query/"/>
> <rdfs:label rdf:datatype="http://www.w3.org/2001/
>XMLSchema#string">SPARQL</rdfs:label>
> </saddle:queryLanguage>
><!-- LINK to the dataset-specific meta data in voiD -->
> <saddle:dataSet rdf:resource="http://midearth:8900/void"/>
>...
> </sd:Service>
></rdf:RDF>
>
>As explained in 1. only the service description is returned, but a
>resolvable URI for the dataset description is included.
>
>--------------------
>"DESCRIBE DATASET" => returns data set description in voiD and
>RDFStats, e.g.:
>
><?xml version="1.0"?>
><rdf:RDF
> xmlns:scovo="http://purl.org/NET/scovo#"
> xmlns:stats="http://rdfstats.sourceforge.net/vocab/rdfstats#"
> xmlns:void="http://rdfs.org/ns/void#" ...other namespaces...>
> <void:Dataset rdf:about="http://midearth:8900/void#">
> <foaf:homepage rdf:resource="http://midearth:8900/"/>
> <dcterms:description>A SemWIQ-powered Linked Data endpoint
>providing RDF data via SPARQL at http://midearth:8900/sparql and
>linked data at http://midearth:8900/resource/...</dcterms:description>
> <dcterms:title>BSBM</dcterms:title>
> </void:Dataset>
> <stats:PropertyHistogram>
> <rdf:value>ATKM/
>WrFOrYAAAAUaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEjZGF0ZVRpbWUDAAAB
>Fy1IeYAAAAEaS0s/
>AAAAAAAAAAACAAAAAAAAAAQAAAAAAAAAAgAAAAAAAAABAAAAAAAAAAMAAAAA
>=
>AAAABAAAAAAAAAADAAAAAAAAAAQAAAAAAAAAAwAAAAAAAAAKAAAAAAAAAAYAAAAAAAAACQAAAAAA
>=
>AAAGAAAAAAAAAAYAAAAAAAAAAQAAAAAAAAABAAAAAAAAAAYAAAAAAAAABAAAAAAAAAAEAAAAAAAA
>AAEAAAAAAAAAOw==</rdf:value>
> <scovo:dataset>
> <stats:RDFStatsDataset rdf:nodeID="A0">
> <dc:creator>dorgon@midearth</dc:creator>
> <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime
>"
> >2009-08-04T20:28:58.084Z</dc:date>
> <stats:sourceType
>rdf:resource="http://rdfstats.sourceforge.net/vocab/rdfstats#SPARQLEndpoint
>"/>
> <stats:sourceUrl rdf:resource="http://midearth:8900/sparql"/>
> </stats:RDFStatsDataset>
> </scovo:dataset>
> <stats:propertyDimension
>rdf:resource="http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/validFr
>om
>"/>
> <stats:rangeDimension
>rdf:resource="http://www.w3.org/2001/XMLSchema#dateTime
>"/>
> </stats:PropertyHistogram>
>...
>
>In order to query the meta data I currently allow to specify sources
>using FROM:
>SELECT * FROM <http://midearth:8900/void> WHERE { ?s a
>stats:RDFStatsDataset ; ?p ?o }
>
>In future this could be supported without an external FROM source
>using sub queries such as
>SELECT * FROM { DESCRIBE DATASET } WHERE { ... }
>
>> In case of DESCRIBE, I think I'd personally prefer a "DESCRIBE <>"
>> instead of "DESCRIBE SELF" and then directly serve the endpoint
>> description, not some intermediate URI (but I see the potential
>> "too much data" problem with a single description).
>
>Dataset meta data can include statistics in order to support
>estimation of expected result sizes and federation in future - we will
>provide details soon. Thus, the result can be several kilobytes to
>megabytes if the dataset is huge. The service description however will
>be rather small in general.
>
>> I'm totall undecided re the actual data to serve, though. My immediate
>> preference/need is statistics, namespaces, extensions (like COUNT,
>> LOAD
>> etc.) and resource types, but I miss a vocabulary that is optimized
>> for
>> the given use case and which also feels stable and/or maintained.
>
>I also miss such a vocabulary. I would be happy to address this at a
>VoCamp or at ISWC together with you, Michael Hausenblas, Jürgen
>Umbrich, Axel Polleres, and others...
>
>Regards,
>Andy
>
>>> Hello,
>>> (CCed to public-lod, I suggest to continue the discussion in the DAWG
>>> list)
>>>
>>> I would like to kick off a general discussion regarding RDF dataset
>>> meta data (e.g. voiD [1]), SPARQL endpoint descriptions [2] and the
>>> possible integration of statistics (e.g. RDFStats [3]) to support
>>> Semantic Web middleware/applications. The general opinion regarding
>>> SPARQL extensions for the next REC seems to be staying as simple as
>>> possible (e.g. recent discussion regarding fulltext search [4]). I
>>> think, while the standard should remain simple (SPARQL-compliant
>>> implementations should be possible with little effort and small
>>> size),
>>> it would be very useful to provide at least extension points that can
>>> be further standardized by the community. For instance, full-text
>>> search capability, aggregates, initial bindings, etc. could be
>>> announced by the endpoint such as here: http://kasei.us/sparql?
>>> about=1
>>>
>>> Similarly, it should be possible to announce the availability of voiD
>>> metadata and statistics. While [2] is targeted to SPARQL endpoints
>>> only including non-HTTP protocols such as ODBC (Virtuoso), voiD [1]
>>> is
>>> targeted to LOD datasets in general which may be available in
>>> different forms (single RDF documents, data dumps, RDFa, and SPARQL
>>> endpoints. I think it is very important to find a best-practice
>>> solution which integrates voiD, future SPARQL endpoint descriptions,
>>> and a consensus on statistics (possibly based on SCOVO [5] - which
>>> should be improved, see [6]).
>>>
>>> Main questions include:
>>>
>>> Q1) How to provide/consume endpoint descriptions in general?
>>> The authors of voiD [1] suggest back-linking from resources
>>> (documents, dumps, etc.) to a voiD dataset.
>>> In case of a SPARQL endpoint they suggest discovery via
>>> sitemap.xml ([1] 5.2.)
>>>
>>> Problem:
>>> Only works via HTTP, only works for 1 endpoint per domain.
>>> Sub-questions:
>>> a) Should non-HTTP protocols be supported?
>>> b) Should multiple SPARQL endpoints per domain be possible? -
>>> In my opinion it should.
>>> Other suggestions based on [2]:
>>> 1. SPARQL extension, like "DESCRIBE SELF" (by AndyS)
>>> 1.1. could return a resolvale URI of the void:Dataset
>>> 1.2. could return the URI of a named graph to query (works
>>> with non-HTTP protocols)
>>> 2. HTTP header, e.g. X-endpoint-description:
>>> http://kasei.us/sparql?about=1
>>> 3. new protocol operation: HTTP OPTIONS for returning the
>>> description
>>> 4. Named graph
>>> 4.1. graph IRI retrieved with DESCRIBE SELF (see 1.2.)
>>> 4.2. graph IRI == SPARQL endpoint URI
>>>
>>> Q2) Which metadata (w/o statistics) to include?
>>> 1. Is there any problem with voiD or what would be missing in
>>> voiD for SPARQL endpoints?
>>>
>>> Q3) Which statistics to include?
>>> 1. simple counts for resources in total, per class / untyped
>>> 2. number of documents in case of data dumps
>>> 3. selectivities for properties (untyped and with given class)
>>> 4. histograms for property values (untyped and with given class,
>>> can be generated with [3])
>>> 5. Is SCOVO sufficient?
>>> 5.1. A bit verbose: dimensions should be simplyfied [6].
>>> 5.2. Encoding histograms not trivial: SCOVO min/max only
>>> useful for integers and nominal scales, RDFStats uses base64-encoded
>>> literals
>>>
>>> Q4) Some metadata are SPARQL-endpoint specific and irrelevant for
>>> datasets in general (i.e. collections/dumps/etc) - but common
>>> metadata
>>> should be reused from voiD for maximum interoperability. We should
>>> define a separate vocabulary for SPARQL endpoint descriptions, but
>>> re-
>>> use common properties from voiD.
>>> Which metadata is SPARQL specific?
>>> a) full-text search...
>>>
>>> Since this may become a larger mindmap, it would be better to work it
>>> out in a wiki. Any suggestions where to continue?
>>>
>>> Regards,
>>> AndyL
>>>
>>>
>>> [1] http://rdfs.org/ns/void-guide
>>> [2] http://www.w3.org/2009/sparql/wiki/Feature:ServiceDescriptions
>>> [3] http://rdfstats.sourceforge.net
>>> [4]
>>>
>http://www.nabble.com/Free-text-search-and-SPARQL-New-Features-and-Rationale-dr
>>> aft-to24324606.html
>>> [5] http://purl.org/NET/scovo#
>>> [6] http://code.google.com/p/void-impl/issues/detail?id=18
>>>
>>> http://www.langegger.at
>>> ----------------------------------------------------------------------
>>> Dipl.-Ing.(FH) Andreas Langegger
>>> FAW - Institute for Application-oriented Knowledge Processing
>>> Johannes Kepler University Linz
>>> A-4040 Linz, Altenberger Straße 69
>>
>
>
>http://www.langegger.at
>----------------------------------------------------------------------
>Dipl.-Ing.(FH) Andreas Langegger
>FAW - Institute for Application-oriented Knowledge Processing
>Johannes Kepler University Linz
>A-4040 Linz, Altenberger Straße 69
>
>
>
>
>
>
Received on Wednesday, 5 August 2009 11:27:23 UTC