Re: SKOS Computability Levels? XML Serialization? from Alistair Miles on 2010-07-20 (public-esw-thes@w3.org from July 2010)

From: Alistair Miles <alimanfoo@googlemail.com>
Date: Tue, 20 Jul 2010 17:08:50 +0100
To: public-esw-thes@w3.org
Message-ID: <20100720160850.GA4328@aliman-desktop>
Hi all,

I got asked off-list about how I got the SPARQL performance results I
mentioned previously, and thought some of this might be of interest to the
rest of the list, so here's a few tidbits...

Last year I did some benchmarking using an RDF dataset generated from the
FlyBase database. Details of the dataset are at [1]. Some notes on query
benchmarking comparing performance of a Jena TDB-backed SPARQL endpoint against
performance of a Postgres SQL database for roughly equivalent queries are at
[2]. It's not very detailed, but it should give an idea.

I found that query performance of Jena TDB was very good but depends on
the design of the query and the data. Roughly speaking, you want the query
engine to evaluate the most selective parts of the query as early as possible,
so you reduce the amount of work the query engine does. TDB has a couple of
different optimisers which will use some basic heuristics plus some statistics
about the data to do some re-ordering of the query for you, but query design
still matters, as does data design (generally, less triple patterns and joins,
less work required, so shorter paths in the rdf data helps).

Generally, I think similar concerns apply both to relational databases and
to XML databases (e.g., see [4]). So having at least a vague idea of the
strategies that the query engine will use to evaluate the query helps.

I also did some work with TDB and Lucene (LARQ) to investigate free text
queries within a sparql query, which worked well, but didn't manage to achieve
the sub-second results I was hoping on larger datasets (100s of millions
of triples). If you want to play, maybe worth checking out sparqlite [3],
although it's a bit behind recent tdb developments now. Joachim Neubert used
sparqlite to implement a service (http://zbw.eu/beta/stw-ws#combined1)
which offers additional search terms for user queries
(e.g. http://econstor.eu/dspace/simple-search?query=telecommuting), using
data from the STW thesaurus.

Graham Klyne and some of the ex-Jena folks at Epimorphics are taking the
work on sparql and lucene forward in the milarq project, so contact them if
you're interested.

For loading data, TDB goes much faster on a 64-bit OS. So I was using m1.xlarge
ec2 instances (64-bit) to perform the data load onto an ebs volume, then
detaching the volume and attaching to an m1.small instance (32-bit) where the
sparql endpoint is hosted (openflydata.org runs on an m1.small ec2 instance).

FWIW, I would have thought that even the largest thesauri could probably
be handled in memory with a fairly modest machine. Running SPARQL queries
against an in-memory rdf graph should be even faster than a tdb-backed
persistent graph.

Cheers

Alistair

[1] http://code.google.com/p/openflydata/wiki/FlyBaseMilestone3 
[2] http://code.google.com/p/openflydata/wiki/FlyBaseBenchmark
[3] http://code.google.com/p/sparqlite/
[4] http://exist.sourceforge.net/tuning.html#N10266

> -----Original Message-----
> From: public-esw-thes-request@w3.org [mailto:public-esw-thes-request@w3.org]
> On Behalf Of Alistair Miles
> Sent: 19 July 2010 13:50
> To: Christophe Dupriez
> Cc: SKOS
> Subject: Re: SKOS Computability Levels? XML Serialization?
> 
> Hi Christophe,
> 
> Interesting questions, I don't have a complete answers, but here's a few
> thoughts...
> 
> On Fri, Jul 16, 2010 at 01:01:31PM +0200, Christophe Dupriez wrote:
> > In the discussion about "validation" (including different KQIs:Key
> > Quality Indicators or Exception listings), one aspect is very
> > important for me as an implementor: computability...
> > 
> > I see that SKOS, compared to ISO 25964 or zThes, is very expandable.
> > But will it remain computable in PRACTICE? For the big thesauri
> > (Agrovoc, Mesh including Substances...) we MUST manage?
> > 
> > To parse SKOS in RDFS, using (sub-)classes and (sub-)properties
> > definitions possibilities, you need OWL artillery:
> > JENA, Sesame/Elmo/AliBaba, Manchester OWL API / Protege SKOSed, others?
> 
> Well, no, you don't need OWL artillery. You just need an RDF rifle :) 
> 
> I did some work with Jena TDB last year, which is a native RDF triple store,
> and had load speeds of 15,000 to 30,000 triples per second. That's not just
> parsing speed, that's parsing and storing in a persistent triple store
> (and indexing too). I was working with RDF datasets in the order of 200
> million triples, which was quite manageable. This is at least an order of
> magnitude bigger than the number of triples required to represent even the
> biggest thesaurus. So if all you are doing is parsing and storing RDF, then
> even the biggest thesauri should be no problem for RDF tools like Jena TDB,
> Sesame, Mulgara or Virtuoso.
> 
> > I did not tested everything but I am still unaware of an OWL
> > framework able to handle BIG thesauri linked with BIG information
> > databases
> > (with reasonable hardware and response time: my applications are
> > used in medical emergencies).
> 
> Again, if you are storing and querying RDF, then I would think you should be
> able to scale to hundreds of millions of triples, and still get sub-second
> SPARQL query times, if you design the queries and the data well.
> 
> E.g., try this link:
> 
> http://openflydata.org/flyui/build/apps/expressionmashup/#query=schuy
> 
> The initial query to find the gene matching "schuy" is against an RDF graph
> that is ~170million triples.
> 
> Having said that, we found query time is very sensitive to the structure of
> the data and the structure of the query. This is not unique to RDF, you'll
> find similar general considerations when designing data structures and
> queries
> for relational databases and XML databases. So I recommend benchmarking your
> queries early on to get a feel for scalability with your data.
> 
> > As a (less but still) flexible alternative, I see XSLT as a
> > serialization tool for a SKOS file into an XML representation of
> > this SKOS data.
> 
> Well, I'm not sure what you mean by "serialization" here. If you mean
> serialising an RDF graph which you have stored in memory or in a persistent
> triplestore to RDF/XML, then you need a library like Jena or Sesame. But if
> you mean *transforming* from RDF/XML to another XML-based representation,
> then XSLT is appropriate.
> 
> > For instance, my test of an XSLT to make a nice presentation of a
> > SKOS file (http://www.askosi.org/xcss/skosrdf2html.xslt),
> > a serialization in HTML not XML, I noticed it is easy to make a
> > transformation for a RDF flavour (usage pattern) but not for all.
> > 
> > XSLT itself is not very good for very big data file unless you can
> > split the data in chunks (transform concept by concept).
> > A specialized parser would do better.
> > 
> > My proposal: to define "computability levels" for SKOS files (like
> > the one existing for OWL)
> > 1) linear: an XML serialization (ISO 25964, zThes or a SKOS XSD to
> > standardize) is possible in a linear way (by applying simple
> > replacements based on easy pattern matching)
> > 2) serializable but not linear: the whole SKOS file must be read in
> > memory to access the necessary data for XML serialization. A generic
> > XSLT program is able to do the transformation.
> > 3) limited inference: a specialized XSLT program (which is adapted
> > to sub-classes and sub-properties defined in the SKOS file) is able
> > to do an adequate and faithful serialization.
> > 4) OWL Lite
> > 5) OWL DL
> > 6) OWL Full
> > and to implement a tool to check the computability level of any
> > given SKOS file.
> 
> I like the idea, but I'm not sure I understand what you mean by levels (1),
> (2) and (3). E.g., what do you mean by "linear"? 
> 
> "Computability" also depends greatly on what sorts of "computation" you want
> to do. What did you have in mind?
> 
> > My opinion is that SKOS is for humans having to efficiently make
> > efficient(Search) User Interfaces.
> > OWL is for humans having to model data to automate (subtle) processes.
> > 
> > Computability is IMHO an important issue for SKOS: when you restart
> > an information server, you want it to be ready to serve in seconds,
> > not hours.
> > Java loads (to get an appropriate memory structure to serve users)
> > an faithful and complete AGROVOC XML serialization in 30 seconds
> > (all languages).
> > Can we hope to do that if a reasoner has to infer relations from an
> > unsorted bag of RDF definitions?
> 
> Well, when working with RDF data, you often *don't* need to involve a
> reasoner. E.g., the applications at openflydata.org all use SPARQL to query
> rdf data in real time, without doing any reasoning at all -- try watching
> the net traffic in firebug while you interact with the application.
> 
> When you absolutely have to use a reasoner, you can often precompute the
> set of inferences you need, then store the copmuted output as triples in
> a normal triplestore. E.g., if you wanted to query the sub-property and
> sub-class closure of a SKOS rdf dataset, that is what I would do. These
> inferences are straightforward to generate (probably minutes), and once
> you've stored the inferences, you'll probably get sub-second sparql query
> times, depending on the query.
> 
> Bottom line - there is *a lot* you can do with RDF and SPARQL without
> needing
> to involve a reasoner at all. And there is even more you can do with a few
> simple tractable inferences computed ahead of time, so you don't need to
> involve a reasoner at runtime.
> 
> But, it all depends on what you want to do with your data :)
>  
> > Does a SKOS validation process should (optionally) generate an XML
> > serialization of SKOS definitions for faster processing?
> 
> The way I implemented the original SKOS validation service was to use
> SPARQL. Each validation test was implemented as a sparql query against the
> data. I stored the RDF data in a persistent triple store (then Jena RDB,
> although now I would use Jena TDB), then executed the SPARQL queries.
> 
> A few of the queries required some inference prior to executing. In that
> case, I computed the set of inferences needed, then queried the inferred
> model with SPARQL.
> 
> So I think you could probably do most of what you need by (1) storing
> your RDF data in a triple store, (2) pre-computing sub-class, sub-property
> closure, and transitive closure of skos:broader, then (3) executing a bunch
> of SPARQL queries.
> 
> > Please find here my proposal for the XML Schema (XSD) definition for
> > SKOS serialization:
> > http://www.askosi.org/ConceptScheme.xsd
> > A readable version is produced using an XSLT from the XS3P project:
> > http://www.askosi.org/example/ConceptScheme.xml
> > 
> > This XSLT was very hard to find but the effort was well compensated:
> > http://sourceforge.net/projects/xs3p/
> > If we could reach the same quality when displaying a SKOS file !
> 
> I think an XSD for SKOS data is a very useful thing! So please don't take
> this
> as any lack of enthusiasm, because I can see many situations where having
> a SKOS XSD would be great (e.g., if I wanted to work with SKOS within an
> XForms application, or use XSLT to generate reports). 
> 
> However, there will be other situations where working against the RDF
> triples
> will be easier, more efficient, and scale better. It all depends on what
> you want to do with the data! What did you have in mind?
> 
> Cheers
> 
> Alistair
> 
> -- 
> Alistair Miles
> Centre for Genomics and Global Health <http://cggh.org>
> The Wellcome Trust Centre for Human Genetics
> Roosevelt Drive
> Oxford
> OX3 7BN
> United Kingdom
> Web: http://purl.org/net/aliman
> Email: alimanfoo@gmail.com
> Tel: +44 (0)1865 287669
> 

-- 
Alistair Miles
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: alimanfoo@gmail.com
Tel: +44 (0)1865 287669
Received on Tuesday, 20 July 2010 16:09:28 UTC