- From: Alistair Miles <alistair.miles@zoo.ox.ac.uk>
- Date: Mon, 23 Mar 2009 11:53:15 +0000
- To: Simon.Cox@csiro.au
- Cc: public-esw-thes@w3.org, jena-dev@yahoogroups.com
Hi Simon, A few comments, not a complete answer but hopefully useful. I haven't tested any of the queries below, so caveat emptor. On Mon, Mar 16, 2009 at 11:04:08AM +0900, Simon.Cox@csiro.au wrote: > We are looking to implement some basic discovery operations over SKOS datasets using SPARQL. > The hypothesis is that, given the relatively strong and stable RDF data model that SKOS provides, some ops could be bundled into a standard 'concept retrieval interface' and implemented by standard vocabulary services. > > However, a couple of what seem (to us) to be obvious ops seem to require multiple requests to implement. > We are interested to know if > (a) we are missing a trick or two in SPARQL > (b) we are deluded in our assessment that these use-cases are common, and thus should be supported more directly > > > Case 1 - 'simple search' > 'Get Concept By Label' > - We want to get a hit if the value of any prefLabel = the target string, regardless of language qualifier. > - since a language qualified value will not match a request for a language-unqualified value (http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/#matchLangTags) we must use a SPARQL filter expression, e.g. > > SELECT $concept WHERE > { > $concept skos:prefLabel $conceptName. > FILTER regex(str($conceptName), "^Ludlow$") > } > > - this generates multiple hits if the language-qualified labels are the same string (e.g. in our case one concept has prefLabel="Ludlow"@en, prefLabel="Ludlow"@fr, prefLabel="Ludlow"@it, prefLabel="Ludlow"@sv) > - the DISTINCT operator will strip the duplicates - great! > > SELECT DISTINCT $concept WHERE > { > $concept skos:prefLabel $conceptName. > FILTER regex(str($conceptName), "^Ludlow$") > } > > - but we want the response to contain all the details of the SKOS Concept, so we need to use DESCRIBE, but > - DISTINCT is only available on SELECT :-( > - SPARQL doesn't support nested queries - i.e. we can't wrap a DESCRIBE around a SELECT DISTINCT > > It looks like we have to add some logic to the client to run a DESCRIBE on each result from the SELECT DISTINCT operation - correct? There are a few options here. First let me say that I think your DESCRIBE query should work fine. E.g. DESCRIBE $concept $broader $narrower $related WHERE { $concept skos:prefLabel $prefLabel . FILTER (str($prefLabel) = "Ludlow") OPTIONAL { $concept skos:broader $broader . } OPTIONAL { $concept skos:narrower $narrower . } OPTIONAL { $concept skos:related $related . } } I don't think you need a distinct keyword here, because any duplicate bindings will disappear as the result graph is constructed (an RDF graph is a *set* of triples and therefore cannot contain duplicates). Also note that I think you can simplify the filter clause, i.e. use can use the equality operator rather than a regex, which may (or may not) be more efficient. (Btw I placed the filter clause at the earliest possible position. A half-decent optimiser should figure this out anyway, but some query engines may not.) Another option would be to stick with SELECT and select more variables, e.g. SELECT * WHERE { $concept skos:prefLabel $prefLabel . FILTER (str($prefLabel) = "Ludlow") OPTIONAL { $concept skos:altLabel $altLabel . } OPTIONAL { $concept skos:hiddenLabel $hiddenLabel . } OPTIONAL { $concept skos:definition $definition . } OPTIONAL { $concept skos:broader $broader . $broader skos:prefLabel $broaderLabel . } OPTIONAL { $concept skos:narrower $narrower . $narrower skos:prefLabel $narrowerLabel . } OPTIONAL { $concept skos:related $related . $related skos:prefLabel $relatedLabel . } } Note that I haven't included a distinct keyword. Even if I did, the result set for this query will contain multiple rows for each distinct concept URI anyway, and so will need a small amount of post-processing to turn into something more useable. This is very typical for apps using sparql, e.g. in flyui [1] we build a set of objects from the raw result set (see e.g. [2] esp. the classes Gene and GenePool and the function newInstancesFromSPARQLResults). Usually trivial to implement. Also note that niether this query or the one before will be very efficient, because I believe it will scan through all possible matches to the first triple pattern in the query, passing them through the filter. For a dataset with tens of thousands of concepts, this may not be too bad (<10s?). If you are interested in matching preflabels in any language, you may need to explore other strategies to get very fast query performance (<1s). One such strategy would be to construct a dataset using a custom predicate e.g. my:prefLabelAnyLang where the values for this predicate are constructed from the values of skos:prefLabel triples but with the lang tag made empty. You could then ask the query SELECT * WHERE { $concept my:prefLabelAnyLang "Ludlow" ; skos:prefLabel $prefLabel . OPTIONAL { $concept skos:altLabel $altLabel . } OPTIONAL { $concept skos:hiddenLabel $hiddenLabel . } OPTIONAL { $concept skos:definition $definition . } OPTIONAL { $concept skos:broader $broader . $broader skos:prefLabel $broaderLabel . } OPTIONAL { $concept skos:narrower $narrower . $narrower skos:prefLabel $narrowerLabel . } OPTIONAL { $concept skos:related $related . $related skos:prefLabel $relatedLabel . } } Personally, I would go for this strategy, because it should give you query performance in the milliseconds. Our experience (with Jena TDB [3]) is that when a query has a ground node to start from (in this case the plain literal "Ludlow") it is generally *much* faster (<<1s) than passing a large number of triple pattern matches through a filter. In general, on a dataset of any size (>100,000 triples), queries relying on equality or regex filters to do text-based searching are unlikely to perform well enough to be used in real-time user interaction apps, although it can be ok if you have to do a batch process and can leave it running for a while. Another strategy would be to use an external index over literals in the graph, and then a property function to access the external index from within SPARQL. E.g. Jena provide a bridge between ARQ and Lucene called LARQ [4], which would allow you to do SELECT * WHERE { $prefLabel pf:textMatch "Ludlow" . $concept skos:prefLabel $prefLabel . OPTIONAL { $concept skos:altLabel $altLabel . } OPTIONAL { $concept skos:hiddenLabel $hiddenLabel . } OPTIONAL { $concept skos:definition $definition . } OPTIONAL { $concept skos:broader $broader . $broader skos:prefLabel $broaderLabel . } OPTIONAL { $concept skos:narrower $narrower . $narrower skos:prefLabel $narrowerLabel . } OPTIONAL { $concept skos:related $related . $related skos:prefLabel $relatedLabel . } } This example assumes a LARQ index built using the string index builder (i.e. an index over plain literals and string styped literals directly). There is also the possibility with LARQ to build an index of subjects on values of some property (e.g. skos:prefLabel) which may be even faster, although I haven't tried that yet. We have some good experience with LARQ, although query performance with a string index can be variable (e.g. 1-10s) for larger datasets (>100 million triples). > Case 2 - 'all my relations' > 'Get Concept And Its Relations' > - we want to get the concept, and also all narrower, or all broader, or maybe just all related - out to some graph radius. > i.e. transitive success. > This is to support some matching or portrayal functions where the objects known to a client application are classified to either more or less detail than the reference system supports. > > Again, it looks like we have to add an iterator to the client, to work through the results of a DESCRIBE request. If you know the graph radius in advance, you can build an (albeit unwieldy) query to that radius. E.g. if radius is 2... DESCRIBE * WHERE { $concept skos:prefLabel $prefLabel . FILTER (str($prefLabel) = "Ludlow") $concept skos:broader $broader1 ; skos:narrower $narrower1 ; skos:related $related1 . $broader1 skos:broader $broader2 . $narrower1 skos:narrower $narrower2 . $related1 skos:related $related2 . } Of course, this query only follows paths of the same predicate, i.e. it doesn't expand the graph along paths with mixed predicates. For that you could do something like DESCRIBE * WHERE { $concept skos:prefLabel $prefLabel . FILTER (str($prefLabel) = "Ludlow") $concept $p1 $concept1 . FILTER ( $p1 = skos:broader || $p1 = skos:narrower || $p1 = skos:related ) $concept1 $p2 $concept2 . FILTER ( $p2 = skos:broader || $p2 = skos:narrower || $p2 = skos:related ) } although we have found queries with variable predicates to be up to an order of magnitude slower than similar queries with ground predicates. Some query engines (e.g. ARQ [5,6]) also have non-standard SPARQL extensions to do graph path expansion and transitive computation, although I haven't tried those yet. Another place you might ask these sorts of questions is the jena-dev mailing list. They're a very helpful bunch and know a *lot* more about SPARQL query processing than I do. Hth, Alistair [1] http://code.google.com/p/flyui/ [2] http://code.google.com/p/flyui/source/browse/trunk/src/flyui/flybase/flybase.js [3] http://jena.hpl.hp.com/wiki/TDB [4] http://jena.sourceforge.net/ARQ/lucene-arq.html [5] http://jena.sourceforge.net/ARQ/documentation.html [6] http://jena.sourceforge.net/ARQ/property_paths.html > > > Simon > ______ > Simon.Cox@csiro.au CSIRO Exploration & Mining > 26 Dick Perry Avenue, Kensington WA 6151 > PO Box 1130, Bentley WA 6102 AUSTRALIA > T: +61 (0)8 6436 8639 Cell: +61 (0) 403 302 672 > Polycom PVX: 130.116.146.28 > <http://www.csiro.au<http://www.csiro.au/>> > > ABN: 41 687 119 230 > -- Alistair Miles Senior Computing Officer Image Bioinformatics Research Group Department of Zoology The Tinbergen Building University of Oxford South Parks Road Oxford OX1 3PS United Kingdom Web: http://purl.org/net/aliman Email: alistair.miles@zoo.ox.ac.uk Tel: +44 (0)1865 281993
Received on Monday, 23 March 2009 11:53:53 UTC