Re: SKOS Computability Levels? XML Serialization?

Hi Christophe,

Interesting questions, I don't have a complete answers, but here's a few
thoughts...

On Fri, Jul 16, 2010 at 01:01:31PM +0200, Christophe Dupriez wrote:
> In the discussion about "validation" (including different KQIs:Key
> Quality Indicators or Exception listings), one aspect is very
> important for me as an implementor: computability...
> 
> I see that SKOS, compared to ISO 25964 or zThes, is very expandable.
> But will it remain computable in PRACTICE? For the big thesauri
> (Agrovoc, Mesh including Substances...) we MUST manage?
> 
> To parse SKOS in RDFS, using (sub-)classes and (sub-)properties
> definitions possibilities, you need OWL artillery:
> JENA, Sesame/Elmo/AliBaba, Manchester OWL API / Protege SKOSed, others?

Well, no, you don't need OWL artillery. You just need an RDF rifle :) 

I did some work with Jena TDB last year, which is a native RDF triple store,
and had load speeds of 15,000 to 30,000 triples per second. That's not just
parsing speed, that's parsing and storing in a persistent triple store
(and indexing too). I was working with RDF datasets in the order of 200
million triples, which was quite manageable. This is at least an order of
magnitude bigger than the number of triples required to represent even the
biggest thesaurus. So if all you are doing is parsing and storing RDF, then
even the biggest thesauri should be no problem for RDF tools like Jena TDB,
Sesame, Mulgara or Virtuoso.

> I did not tested everything but I am still unaware of an OWL
> framework able to handle BIG thesauri linked with BIG information
> databases
> (with reasonable hardware and response time: my applications are
> used in medical emergencies).

Again, if you are storing and querying RDF, then I would think you should be
able to scale to hundreds of millions of triples, and still get sub-second
SPARQL query times, if you design the queries and the data well.

E.g., try this link:

http://openflydata.org/flyui/build/apps/expressionmashup/#query=schuy

The initial query to find the gene matching "schuy" is against an RDF graph
that is ~170million triples.

Having said that, we found query time is very sensitive to the structure of
the data and the structure of the query. This is not unique to RDF, you'll
find similar general considerations when designing data structures and queries
for relational databases and XML databases. So I recommend benchmarking your
queries early on to get a feel for scalability with your data.

> As a (less but still) flexible alternative, I see XSLT as a
> serialization tool for a SKOS file into an XML representation of
> this SKOS data.

Well, I'm not sure what you mean by "serialization" here. If you mean
serialising an RDF graph which you have stored in memory or in a persistent
triplestore to RDF/XML, then you need a library like Jena or Sesame. But if
you mean *transforming* from RDF/XML to another XML-based representation,
then XSLT is appropriate.

> For instance, my test of an XSLT to make a nice presentation of a
> SKOS file (http://www.askosi.org/xcss/skosrdf2html.xslt),
> a serialization in HTML not XML, I noticed it is easy to make a
> transformation for a RDF flavour (usage pattern) but not for all.
> 
> XSLT itself is not very good for very big data file unless you can
> split the data in chunks (transform concept by concept).
> A specialized parser would do better.
> 
> My proposal: to define "computability levels" for SKOS files (like
> the one existing for OWL)
> 1) linear: an XML serialization (ISO 25964, zThes or a SKOS XSD to
> standardize) is possible in a linear way (by applying simple
> replacements based on easy pattern matching)
> 2) serializable but not linear: the whole SKOS file must be read in
> memory to access the necessary data for XML serialization. A generic
> XSLT program is able to do the transformation.
> 3) limited inference: a specialized XSLT program (which is adapted
> to sub-classes and sub-properties defined in the SKOS file) is able
> to do an adequate and faithful serialization.
> 4) OWL Lite
> 5) OWL DL
> 6) OWL Full
> and to implement a tool to check the computability level of any
> given SKOS file.

I like the idea, but I'm not sure I understand what you mean by levels (1),
(2) and (3). E.g., what do you mean by "linear"? 

"Computability" also depends greatly on what sorts of "computation" you want
to do. What did you have in mind?

> My opinion is that SKOS is for humans having to efficiently make
> efficient(Search) User Interfaces.
> OWL is for humans having to model data to automate (subtle) processes.
> 
> Computability is IMHO an important issue for SKOS: when you restart
> an information server, you want it to be ready to serve in seconds,
> not hours.
> Java loads (to get an appropriate memory structure to serve users)
> an faithful and complete AGROVOC XML serialization in 30 seconds
> (all languages).
> Can we hope to do that if a reasoner has to infer relations from an
> unsorted bag of RDF definitions?

Well, when working with RDF data, you often *don't* need to involve a
reasoner. E.g., the applications at openflydata.org all use SPARQL to query
rdf data in real time, without doing any reasoning at all -- try watching
the net traffic in firebug while you interact with the application.

When you absolutely have to use a reasoner, you can often precompute the
set of inferences you need, then store the copmuted output as triples in
a normal triplestore. E.g., if you wanted to query the sub-property and
sub-class closure of a SKOS rdf dataset, that is what I would do. These
inferences are straightforward to generate (probably minutes), and once
you've stored the inferences, you'll probably get sub-second sparql query
times, depending on the query.

Bottom line - there is *a lot* you can do with RDF and SPARQL without needing
to involve a reasoner at all. And there is even more you can do with a few
simple tractable inferences computed ahead of time, so you don't need to
involve a reasoner at runtime.

But, it all depends on what you want to do with your data :)
 
> Does a SKOS validation process should (optionally) generate an XML
> serialization of SKOS definitions for faster processing?

The way I implemented the original SKOS validation service was to use
SPARQL. Each validation test was implemented as a sparql query against the
data. I stored the RDF data in a persistent triple store (then Jena RDB,
although now I would use Jena TDB), then executed the SPARQL queries.

A few of the queries required some inference prior to executing. In that
case, I computed the set of inferences needed, then queried the inferred
model with SPARQL.

So I think you could probably do most of what you need by (1) storing
your RDF data in a triple store, (2) pre-computing sub-class, sub-property
closure, and transitive closure of skos:broader, then (3) executing a bunch
of SPARQL queries.

> Please find here my proposal for the XML Schema (XSD) definition for
> SKOS serialization:
> http://www.askosi.org/ConceptScheme.xsd
> A readable version is produced using an XSLT from the XS3P project:
> http://www.askosi.org/example/ConceptScheme.xml
> 
> This XSLT was very hard to find but the effort was well compensated:
> http://sourceforge.net/projects/xs3p/
> If we could reach the same quality when displaying a SKOS file !

I think an XSD for SKOS data is a very useful thing! So please don't take this
as any lack of enthusiasm, because I can see many situations where having
a SKOS XSD would be great (e.g., if I wanted to work with SKOS within an
XForms application, or use XSLT to generate reports). 

However, there will be other situations where working against the RDF triples
will be easier, more efficient, and scale better. It all depends on what
you want to do with the data! What did you have in mind?

Cheers

Alistair

-- 
Alistair Miles
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: alimanfoo@gmail.com
Tel: +44 (0)1865 287669

Received on Monday, 19 July 2010 12:50:54 UTC