- From: Benjamin Nowack <bnowack@semsol.com>
- Date: Wed, 5 Aug 2009 13:26:43 +0200
- To: Andreas Langegger <al@jku.at>
- Cc: public-rdf-dawg-comments@w3.org
Hi Andy, Thanks for sharing the experience. Separating the service from the dataset makes a lot of sense, I like that idea. Cheers, Benji On 04.08.2009 22:39:38, Andreas Langegger wrote: >## CONTINUED ## > >Hi Benji, > >argh, sorry, accidentially hit the send button before... > >On Jul 27, 2009, at 5:40 PM, Benjamin Nowack wrote: > >> Completing the endpoint description feature is still on my list for >> ARC, what I've got so far WRT to Q1 is that the endpoint returns the >> description directly at its URL when the accept header indicates RDF >> as preferred result format and when no query parameter is sent. It was >> pretty easy to implement and didn't require the invention of a syntax >> extension or magic queries/graphs. If the header checker ranks HTML >> over RDF, the usual input form is served. > >Sounds reasonable, that's the HTTP header approach. > >> I'm not so sure about using a special DESCRIBE as a basis as it would >> require squeezing the endpoint description feature into the query >> processor instead of being able to directly catch the description >> request at the endpoint level. It would be doable, though, and would >> enable the non-HTTP access you mentioned. Messing with headers (as in >> my approach) can be fragile. > >You're right, but I suggest to discuss this now in progress of SPARQL/ >Query 1.1 without taking into account implementation efforts in order >to find the best approach in terms of architectural styles. I >implemented DESCRIBE SELF and DESCRIBE DATASET in Joseki and will >explain my findings. After some experiments, I came to the following >conclusions: > >1. I advocate for a separation of service description and dataset >description: > >The service description includes information such as supported query >syntax, result formats, features such as fulltext search, entailment >regimes, etc. and won't change in the short run as it depends on the >implementation of the processor and SPARQL endpoint. > >The dataset description is only related to the data behind the >endpoint, independent of the access interface. It will change more >frequently, especially if it contains dc:subjects and detailed >statistics. > >2. Differences between endpoint/dataset description for SPARQL >endpoints and Linked Datasets: > >Although both use cases have much in common, differences exist in >terms of resource discovery and the kind of meta data. While >describing SPARQL endpoints is similar to describing Web services with >WSDL, describing Linked Datasets is similar to content discovery >(robots.txt => sitemap.xml => void:Dataset...). As explained in 1.) >SPARQL endpoints in addition have a technical service description, >while Linked Datasets only have dataset descriptions (e.g. in voiD). >Both use cases should be aligned though. > >3. Requirements that should be fulfilled: > >I. It should be possible to fetch SPARQL endpoint meta data via the >endpoint itself and not via some "well-known" HTTP URI such as / >robots.txt etc. >II. It should be possible for clients to just look at the HTTP >headers and status code and see if endpoint meta data have changed >(Last-modified). This is especially important when low-level >statistics are provided and periodically fetched by clients (we will >provide more details in future in the context of federation). >III. It should be possible to either completely fetch SPARQL endpoint >and dataset description based on HTTP URI resolution or query it via >the SPARQL endpoint (e.g. by supporting sub-queries in upcoming SPARQL/ >Query 1.1.) > >4. My proposal (which I've implemented in Joseki/ARQ): > >-------------------- >"DESCRIBE SELF" => returns SPARQL endpoint service description, e.g.: > ><?xml version="1.0"?> ><rdf:RDF ...namespace declarations....> > <sd:Service> > <rdfs:label rdf:datatype="http://www.w3.org/2001/ >XMLSchema#string">Joseki SPARQL Endpoint</rdfs:label> > <saddle:resultFormat rdf:parseType="Resource"> > <saddle:spec rdf:resource="http://www.w3.org/TR/rdf-sparql-json-res/ >"/> > <saddle:mediaType rdf:datatype="http://www.w3.org/2001/XMLSchema#string >">application/sparql-results+json</saddle:mediaType> > <rdfs:label rdf:datatype="http://www.w3.org/2001/ >XMLSchema#string">SPARQL/JSON</rdfs:label> > </saddle:resultFormat> > <saddle:queryLanguage rdf:parseType="Resource"> > <saddle:spec rdf:resource="http://www.w3.org/TR/rdf-sparql- >query/"/> > <rdfs:label rdf:datatype="http://www.w3.org/2001/ >XMLSchema#string">SPARQL</rdfs:label> > </saddle:queryLanguage> ><!-- LINK to the dataset-specific meta data in voiD --> > <saddle:dataSet rdf:resource="http://midearth:8900/void"/> >... > </sd:Service> ></rdf:RDF> > >As explained in 1. only the service description is returned, but a >resolvable URI for the dataset description is included. > >-------------------- >"DESCRIBE DATASET" => returns data set description in voiD and >RDFStats, e.g.: > ><?xml version="1.0"?> ><rdf:RDF > xmlns:scovo="http://purl.org/NET/scovo#" > xmlns:stats="http://rdfstats.sourceforge.net/vocab/rdfstats#" > xmlns:void="http://rdfs.org/ns/void#" ...other namespaces...> > <void:Dataset rdf:about="http://midearth:8900/void#"> > <foaf:homepage rdf:resource="http://midearth:8900/"/> > <dcterms:description>A SemWIQ-powered Linked Data endpoint >providing RDF data via SPARQL at http://midearth:8900/sparql and >linked data at http://midearth:8900/resource/...</dcterms:description> > <dcterms:title>BSBM</dcterms:title> > </void:Dataset> > <stats:PropertyHistogram> > <rdf:value>ATKM/ >WrFOrYAAAAUaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEjZGF0ZVRpbWUDAAAB >Fy1IeYAAAAEaS0s/ >AAAAAAAAAAACAAAAAAAAAAQAAAAAAAAAAgAAAAAAAAABAAAAAAAAAAMAAAAA >= >AAAABAAAAAAAAAADAAAAAAAAAAQAAAAAAAAAAwAAAAAAAAAKAAAAAAAAAAYAAAAAAAAACQAAAAAA >= >AAAGAAAAAAAAAAYAAAAAAAAAAQAAAAAAAAABAAAAAAAAAAYAAAAAAAAABAAAAAAAAAAEAAAAAAAA >AAEAAAAAAAAAOw==</rdf:value> > <scovo:dataset> > <stats:RDFStatsDataset rdf:nodeID="A0"> > <dc:creator>dorgon@midearth</dc:creator> > <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime >" > >2009-08-04T20:28:58.084Z</dc:date> > <stats:sourceType >rdf:resource="http://rdfstats.sourceforge.net/vocab/rdfstats#SPARQLEndpoint >"/> > <stats:sourceUrl rdf:resource="http://midearth:8900/sparql"/> > </stats:RDFStatsDataset> > </scovo:dataset> > <stats:propertyDimension >rdf:resource="http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/validFr >om >"/> > <stats:rangeDimension >rdf:resource="http://www.w3.org/2001/XMLSchema#dateTime >"/> > </stats:PropertyHistogram> >... > >In order to query the meta data I currently allow to specify sources >using FROM: >SELECT * FROM <http://midearth:8900/void> WHERE { ?s a >stats:RDFStatsDataset ; ?p ?o } > >In future this could be supported without an external FROM source >using sub queries such as >SELECT * FROM { DESCRIBE DATASET } WHERE { ... } > >> In case of DESCRIBE, I think I'd personally prefer a "DESCRIBE <>" >> instead of "DESCRIBE SELF" and then directly serve the endpoint >> description, not some intermediate URI (but I see the potential >> "too much data" problem with a single description). > >Dataset meta data can include statistics in order to support >estimation of expected result sizes and federation in future - we will >provide details soon. Thus, the result can be several kilobytes to >megabytes if the dataset is huge. The service description however will >be rather small in general. > >> I'm totall undecided re the actual data to serve, though. My immediate >> preference/need is statistics, namespaces, extensions (like COUNT, >> LOAD >> etc.) and resource types, but I miss a vocabulary that is optimized >> for >> the given use case and which also feels stable and/or maintained. > >I also miss such a vocabulary. I would be happy to address this at a >VoCamp or at ISWC together with you, Michael Hausenblas, Jürgen >Umbrich, Axel Polleres, and others... > >Regards, >Andy > >>> Hello, >>> (CCed to public-lod, I suggest to continue the discussion in the DAWG >>> list) >>> >>> I would like to kick off a general discussion regarding RDF dataset >>> meta data (e.g. voiD [1]), SPARQL endpoint descriptions [2] and the >>> possible integration of statistics (e.g. RDFStats [3]) to support >>> Semantic Web middleware/applications. The general opinion regarding >>> SPARQL extensions for the next REC seems to be staying as simple as >>> possible (e.g. recent discussion regarding fulltext search [4]). I >>> think, while the standard should remain simple (SPARQL-compliant >>> implementations should be possible with little effort and small >>> size), >>> it would be very useful to provide at least extension points that can >>> be further standardized by the community. For instance, full-text >>> search capability, aggregates, initial bindings, etc. could be >>> announced by the endpoint such as here: http://kasei.us/sparql? >>> about=1 >>> >>> Similarly, it should be possible to announce the availability of voiD >>> metadata and statistics. While [2] is targeted to SPARQL endpoints >>> only including non-HTTP protocols such as ODBC (Virtuoso), voiD [1] >>> is >>> targeted to LOD datasets in general which may be available in >>> different forms (single RDF documents, data dumps, RDFa, and SPARQL >>> endpoints. I think it is very important to find a best-practice >>> solution which integrates voiD, future SPARQL endpoint descriptions, >>> and a consensus on statistics (possibly based on SCOVO [5] - which >>> should be improved, see [6]). >>> >>> Main questions include: >>> >>> Q1) How to provide/consume endpoint descriptions in general? >>> The authors of voiD [1] suggest back-linking from resources >>> (documents, dumps, etc.) to a voiD dataset. >>> In case of a SPARQL endpoint they suggest discovery via >>> sitemap.xml ([1] 5.2.) >>> >>> Problem: >>> Only works via HTTP, only works for 1 endpoint per domain. >>> Sub-questions: >>> a) Should non-HTTP protocols be supported? >>> b) Should multiple SPARQL endpoints per domain be possible? - >>> In my opinion it should. >>> Other suggestions based on [2]: >>> 1. SPARQL extension, like "DESCRIBE SELF" (by AndyS) >>> 1.1. could return a resolvale URI of the void:Dataset >>> 1.2. could return the URI of a named graph to query (works >>> with non-HTTP protocols) >>> 2. HTTP header, e.g. X-endpoint-description: >>> http://kasei.us/sparql?about=1 >>> 3. new protocol operation: HTTP OPTIONS for returning the >>> description >>> 4. Named graph >>> 4.1. graph IRI retrieved with DESCRIBE SELF (see 1.2.) >>> 4.2. graph IRI == SPARQL endpoint URI >>> >>> Q2) Which metadata (w/o statistics) to include? >>> 1. Is there any problem with voiD or what would be missing in >>> voiD for SPARQL endpoints? >>> >>> Q3) Which statistics to include? >>> 1. simple counts for resources in total, per class / untyped >>> 2. number of documents in case of data dumps >>> 3. selectivities for properties (untyped and with given class) >>> 4. histograms for property values (untyped and with given class, >>> can be generated with [3]) >>> 5. Is SCOVO sufficient? >>> 5.1. A bit verbose: dimensions should be simplyfied [6]. >>> 5.2. Encoding histograms not trivial: SCOVO min/max only >>> useful for integers and nominal scales, RDFStats uses base64-encoded >>> literals >>> >>> Q4) Some metadata are SPARQL-endpoint specific and irrelevant for >>> datasets in general (i.e. collections/dumps/etc) - but common >>> metadata >>> should be reused from voiD for maximum interoperability. We should >>> define a separate vocabulary for SPARQL endpoint descriptions, but >>> re- >>> use common properties from voiD. >>> Which metadata is SPARQL specific? >>> a) full-text search... >>> >>> Since this may become a larger mindmap, it would be better to work it >>> out in a wiki. Any suggestions where to continue? >>> >>> Regards, >>> AndyL >>> >>> >>> [1] http://rdfs.org/ns/void-guide >>> [2] http://www.w3.org/2009/sparql/wiki/Feature:ServiceDescriptions >>> [3] http://rdfstats.sourceforge.net >>> [4] >>> >http://www.nabble.com/Free-text-search-and-SPARQL-New-Features-and-Rationale-dr >>> aft-to24324606.html >>> [5] http://purl.org/NET/scovo# >>> [6] http://code.google.com/p/void-impl/issues/detail?id=18 >>> >>> http://www.langegger.at >>> ---------------------------------------------------------------------- >>> Dipl.-Ing.(FH) Andreas Langegger >>> FAW - Institute for Application-oriented Knowledge Processing >>> Johannes Kepler University Linz >>> A-4040 Linz, Altenberger Straße 69 >> > > >http://www.langegger.at >---------------------------------------------------------------------- >Dipl.-Ing.(FH) Andreas Langegger >FAW - Institute for Application-oriented Knowledge Processing >Johannes Kepler University Linz >A-4040 Linz, Altenberger Straße 69 > > > > > >
Received on Wednesday, 5 August 2009 11:27:23 UTC