Re: SKOS and SPARQL - what seems like an obvious use-case is difficult to implement from Alistair Miles on 2009-03-23 (public-esw-thes@w3.org from March 2009)

From: Alistair Miles <alistair.miles@zoo.ox.ac.uk>
Date: Mon, 23 Mar 2009 11:53:15 +0000
To: Simon.Cox@csiro.au
Cc: public-esw-thes@w3.org, jena-dev@yahoogroups.com
Message-ID: <20090323115313.GA11262@skiathos>
Hi Simon,

A few comments, not a complete answer but hopefully useful. I haven't
tested any of the queries below, so caveat emptor.

On Mon, Mar 16, 2009 at 11:04:08AM +0900, Simon.Cox@csiro.au wrote:
> We are looking to implement some basic discovery operations over SKOS datasets using SPARQL.
> The hypothesis is that, given the relatively strong and stable RDF data model that SKOS provides, some ops could be bundled into a standard 'concept retrieval interface' and implemented by standard vocabulary services.
> 
> However, a couple of what seem (to us) to be obvious ops seem to require multiple requests to implement.
> We are interested to know if
> (a) we are missing a trick or two in SPARQL
> (b) we are deluded in our assessment that these use-cases are common, and thus should be supported more directly
> 
> 
> Case 1 - 'simple search'
> 'Get Concept By Label'
> - We want to get a hit if the value of any prefLabel = the target string, regardless of language qualifier.
> - since a language qualified value will not match a request for a language-unqualified value (http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/#matchLangTags) we must use a SPARQL filter expression, e.g.
> 
> SELECT $concept WHERE
> {
> $concept skos:prefLabel $conceptName.
> FILTER regex(str($conceptName), "^Ludlow$")
> }
> 
> - this generates multiple hits if the language-qualified labels are the same string (e.g. in our case one concept has prefLabel="Ludlow"@en, prefLabel="Ludlow"@fr, prefLabel="Ludlow"@it, prefLabel="Ludlow"@sv)
> - the DISTINCT operator will strip the duplicates - great!
> 
> SELECT DISTINCT $concept WHERE
> {
> $concept skos:prefLabel $conceptName.
> FILTER regex(str($conceptName), "^Ludlow$")
> }
> 
> - but we want the response to contain all the details of the SKOS Concept, so we need to use DESCRIBE, but
>     - DISTINCT is only available on SELECT :-(
>     - SPARQL doesn't support nested queries - i.e. we can't wrap a DESCRIBE around a SELECT DISTINCT
> 
> It looks like we have to add some logic to the client to run a DESCRIBE on each result from the SELECT DISTINCT operation - correct?

There are a few options here. 

First let me say that I think your DESCRIBE query should work fine. E.g.

DESCRIBE $concept $broader $narrower $related WHERE 
{
  $concept skos:prefLabel $prefLabel .
  FILTER (str($prefLabel) = "Ludlow")
  OPTIONAL { $concept skos:broader $broader . }
  OPTIONAL { $concept skos:narrower $narrower . }
  OPTIONAL { $concept skos:related $related . }
}

I don't think you need a distinct keyword here, because any duplicate
bindings will disappear as the result graph is constructed (an RDF
graph is a *set* of triples and therefore cannot contain duplicates).

Also note that I think you can simplify the filter clause, i.e. use
can use the equality operator rather than a regex, which may (or may
not) be more efficient. (Btw I placed the filter clause at the
earliest possible position. A half-decent optimiser should figure this
out anyway, but some query engines may not.)

Another option would be to stick with SELECT and select more variables, e.g.

SELECT * WHERE
{
  $concept skos:prefLabel $prefLabel .
  FILTER (str($prefLabel) = "Ludlow")
  OPTIONAL { $concept skos:altLabel $altLabel . }
  OPTIONAL { $concept skos:hiddenLabel $hiddenLabel . }
  OPTIONAL { $concept skos:definition $definition . }
  OPTIONAL { $concept skos:broader $broader . $broader skos:prefLabel $broaderLabel . }
  OPTIONAL { $concept skos:narrower $narrower . $narrower skos:prefLabel $narrowerLabel . }
  OPTIONAL { $concept skos:related $related . $related skos:prefLabel $relatedLabel . }
}

Note that I haven't included a distinct keyword. Even if I did, the
result set for this query will contain multiple rows for each distinct
concept URI anyway, and so will need a small amount of post-processing
to turn into something more useable. This is very typical for apps
using sparql, e.g. in flyui [1] we build a set of objects from the raw
result set (see e.g. [2] esp. the classes Gene and GenePool and the
function newInstancesFromSPARQLResults). Usually trivial to implement.

Also note that niether this query or the one before will be very
efficient, because I believe it will scan through all possible matches
to the first triple pattern in the query, passing them through the
filter. For a dataset with tens of thousands of concepts, this may not
be too bad (<10s?). If you are interested in matching preflabels in
any language, you may need to explore other strategies to get very
fast query performance (<1s).

One such strategy would be to construct a dataset using a custom
predicate e.g. my:prefLabelAnyLang where the values for this predicate
are constructed from the values of skos:prefLabel triples but with the
lang tag made empty. You could then ask the query

SELECT * WHERE
{
  $concept my:prefLabelAnyLang "Ludlow" ;
    skos:prefLabel $prefLabel .
  OPTIONAL { $concept skos:altLabel $altLabel . }
  OPTIONAL { $concept skos:hiddenLabel $hiddenLabel . }
  OPTIONAL { $concept skos:definition $definition . }
  OPTIONAL { $concept skos:broader $broader . $broader skos:prefLabel $broaderLabel . }
  OPTIONAL { $concept skos:narrower $narrower . $narrower skos:prefLabel $narrowerLabel . }
  OPTIONAL { $concept skos:related $related . $related skos:prefLabel $relatedLabel . }
}

Personally, I would go for this strategy, because it should give you
query performance in the milliseconds. Our experience (with Jena TDB
[3]) is that when a query has a ground node to start from (in this
case the plain literal "Ludlow") it is generally *much* faster (<<1s)
than passing a large number of triple pattern matches through a
filter. In general, on a dataset of any size (>100,000 triples),
queries relying on equality or regex filters to do text-based
searching are unlikely to perform well enough to be used in real-time
user interaction apps, although it can be ok if you have to do a batch
process and can leave it running for a while.

Another strategy would be to use an external index over literals in
the graph, and then a property function to access the external index
from within SPARQL. E.g. Jena provide a bridge between ARQ and Lucene
called LARQ [4], which would allow you to do

SELECT * WHERE
{
  $prefLabel pf:textMatch "Ludlow" .
  $concept skos:prefLabel $prefLabel .
  OPTIONAL { $concept skos:altLabel $altLabel . }
  OPTIONAL { $concept skos:hiddenLabel $hiddenLabel . }
  OPTIONAL { $concept skos:definition $definition . }
  OPTIONAL { $concept skos:broader $broader . $broader skos:prefLabel $broaderLabel . }
  OPTIONAL { $concept skos:narrower $narrower . $narrower skos:prefLabel $narrowerLabel . }
  OPTIONAL { $concept skos:related $related . $related skos:prefLabel $relatedLabel . }
}

This example assumes a LARQ index built using the string index builder
(i.e. an index over plain literals and string styped literals
directly). There is also the possibility with LARQ to build an index
of subjects on values of some property (e.g. skos:prefLabel) which may
be even faster, although I haven't tried that yet.

We have some good experience with LARQ, although query performance
with a string index can be variable (e.g. 1-10s) for larger datasets
(>100 million triples).

> Case 2 - 'all my relations'
> 'Get Concept And Its Relations'
> - we want to get the concept, and also all narrower, or all broader, or maybe just all related - out to some graph radius.
> i.e. transitive success.
> This is to support some matching or portrayal functions where the objects known to a client application are classified to either more or less detail than the reference system supports.
> 
> Again, it looks like we have to add an iterator to the client, to work through the results of a DESCRIBE request.

If you know the graph radius in advance, you can build an (albeit
unwieldy) query to that radius. E.g. if radius is 2...

DESCRIBE * WHERE
{
  $concept skos:prefLabel $prefLabel .
  FILTER (str($prefLabel) = "Ludlow")
  $concept skos:broader $broader1 ; skos:narrower $narrower1 ; skos:related $related1 .
  $broader1 skos:broader $broader2 .
  $narrower1 skos:narrower $narrower2 .
  $related1 skos:related $related2 .
}

Of course, this query only follows paths of the same predicate,
i.e. it doesn't expand the graph along paths with mixed
predicates. For that you could do something like

DESCRIBE * WHERE
{
  $concept skos:prefLabel $prefLabel .
  FILTER (str($prefLabel) = "Ludlow")
  $concept $p1 $concept1 .
  FILTER ( $p1 = skos:broader || $p1 = skos:narrower || $p1 = skos:related )
  $concept1 $p2 $concept2 .
  FILTER ( $p2 = skos:broader || $p2 = skos:narrower || $p2 = skos:related )
}

although we have found queries with variable predicates to be up to an
order of magnitude slower than similar queries with ground predicates.

Some query engines (e.g. ARQ [5,6]) also have non-standard SPARQL
extensions to do graph path expansion and transitive computation,
although I haven't tried those yet.

Another place you might ask these sorts of questions is the jena-dev
mailing list. They're a very helpful bunch and know a *lot* more about
SPARQL query processing than I do. 

Hth,

Alistair

[1] http://code.google.com/p/flyui/
[2] http://code.google.com/p/flyui/source/browse/trunk/src/flyui/flybase/flybase.js
[3] http://jena.hpl.hp.com/wiki/TDB
[4] http://jena.sourceforge.net/ARQ/lucene-arq.html
[5] http://jena.sourceforge.net/ARQ/documentation.html
[6] http://jena.sourceforge.net/ARQ/property_paths.html


> 
> 
> Simon
> ______
> Simon.Cox@csiro.au  CSIRO Exploration & Mining
> 26 Dick Perry Avenue, Kensington WA 6151
> PO Box 1130, Bentley WA 6102  AUSTRALIA
> T: +61 (0)8 6436 8639  Cell: +61 (0) 403 302 672
> Polycom PVX: 130.116.146.28
> <http://www.csiro.au<http://www.csiro.au/>>
> 
> ABN: 41 687 119 230
> 

-- 
Alistair Miles
Senior Computing Officer
Image Bioinformatics Research Group
Department of Zoology
The Tinbergen Building
University of Oxford
South Parks Road
Oxford
OX1 3PS
United Kingdom
Web: http://purl.org/net/aliman
Email: alistair.miles@zoo.ox.ac.uk
Tel: +44 (0)1865 281993
Received on Monday, 23 March 2009 11:53:53 UTC