Re: RDF implementation reports from Andrew Newman on 2003-09-04 (www-rdf-interest@w3.org from September 2003)

From: Andrew Newman <anewman@pisoftware.com>
Date: Fri, 05 Sep 2003 08:04:26 +1000
To: Dave Beckett <dave.beckett@bristol.ac.uk>
Cc: www-rdf-interest@w3.org
Message-ID: <3F57B6EA.4040503@pisoftware.com>
Report on TKS (Tucana Knowledge Store)
======================================

Implementation
==============

TKS is generic store for storing graph like data structures and has been
optimised for use with RDF although it could be applied to other
applications such as Topic Maps.

It is written in Java 1.4 with extensive use of NIO to provide a fast,
reliable, transactional data store.

TKS has been available commercially for two years, but is in the
final stages of release under the Mozilla Public License, version 1.1.
Release under the MPL is anticipated in October, 2003.

RDF Import
==========

A modified Jena RDF parser, version 1.4, is used to import RDF.  A port 
to Jena 2.0 is underway.  It currently imports a draft datatyping 
standard rather than the current recommendation.

TKS supports URIs, strings, numbers (floats), dates and date/times. 
Timezone support has not yet been added.  The addition of 
furtherdatatype support is part of the future roadmap of development. 
We plan to implement, at least, a subset of the XML Schema datatypes.

TKS has been used to import various large graphs.  We regularly use 
Wordnet, a modified dmoz, and our own production applications on a daily 
basis for the storage, querying, back-up and maintenance of this data.

TKS datastore contains a node pool, string pool and graph.  With these 
data structures it is difficult to say what effect the amount of RDF has 
on disk usage.  Things that can effect this include the unique number of
strings, number of differing predicates and the number of blank nodes.

The following analysis is based on "world.rdf" which contains RDF/XML
generated from publicly available geographic data from USGS.  This data 
is probably atypical of most RDF as it contains a high percentage of 
unique literals and no blank nodes.

The "world.rdf" contains: 36332560 triples, 9431582 RDF nodes, 4227899
literals (SPStrings), 5203683 resources (SPURIs) and 0 blank nodes.

This created on the 32-bit version of TKS the following file sizes:

    59040410  graph avl files
  4650567680  graph block files
  4709608090  graph total

   113178984  stringPool.sp_ns
   382465468  stringPool.sp_avl
   433066624  string block files

  5638319166  Total
   554684862  Total mapped on 32-bit platforms

This was loaded in ~240 minutes which gives 2523 triples/second.  This
was on a 1GHz Pentium III with 512MB RAM and running Sun's JDK 1.4.0.

Restoring from backup, using TKS's own data structures, results in load
times approximately 2-3 times faster than directly parsing RDF/XML.

There is quite a lot of room for optimisation when loading data.

The current version of TKS has resulted in roughly double the storage
requirements of the graph (as a result of going from 32 to 64 bits).

Of the 105 files that comprise a TKS database while TKS is running only 
the six triple block files and the twenty string block files use explicit
(seek/read/write) I/O, all the rest use memory-mapped I/O.  On 64-bit
platforms, by default all 105 files are memory mapped.

64-bit platforms
----------------

Windows, Solaris and Linux all support 64 bit offsets and files larger 
than the old 2GB limit.  Linux supports files up to nearly 2 TB in size 
on 32 bit architectures although other problems currently limit 
filesystem sizes to 1 TB.

All relevant fields, of the current store, in-memory and on-disk data
structures are 64 bits wide, thus ensuring that TKS can store very large
amounts of data up to the limits imposed by the host operating system.

RDF Export
==========

We currently only provide the results from queries as sum of products
rather than as RDF/XML.

We expect proper export of RDF/XML to happen relatively soon.

Local APIs
==========

The main interfaces to the database include a query interpreter or Java
API.

The local API also provides some graph like querying.  We have a 
weighted related to" query which finds resources that are similar to 
other resources based on common arcs that they each have.  A queries can 
be applied to filter the arcs, reducing the "predicate-object" pairs. 
The "predicate-object" pairs can be one, two or more levels of 
"predicate-object" pairs.  For example, if a date is a resource which 
has a day, month and year.  You can pick up resources with the same day 
and month or month and year in common.  Or you can have documents which 
have the same day, month in common.

The "similar to" query is the reverse of "weighted related to" where a 
given literal is used to find other literals that are found to be 
similar via other resources.

A Jena layer is a high priority in a future release of TKS.

Remote Interfaces
=================

Queries can be issued directly using the Java RMI layer or through a 
number of interfaces including Perl, SOAP, COM, HTTP (via a servlet), or 
JSP tags.

The current development version of TKS provides a query engine that 
streams the results in fixed pages to the client.  A Web-based query 
interface is also provided.

Entailments
===========

We currently do not support an inferencing layer.

We anticipate that with the inclusion of Jena 2.0 we will get most of 
the functionality that it provides.  We also anticipate that 
optimisations based ontop of our triple layer should offer improvements 
over the existing Jena implementation.  We intend to apply some of these 
entailment rules at store time and others at query time.

Query Support
=============

iTQL is a Squish-like language, with an emphasis on grouping statements
using models.  It supports aliasing, backups, transactions, creation,
deletion, and distributed queries.

iTQL commands are optimised through the use of heuristics built up over 
our current use cases based on application requirements.

We've implemented predicates in the query language which provide support 
for datatypes using proprietary tokens such as "<tks:lt>" for numbers 
and "<tks:before>" for dates.

We support the querying of Lucene as RDF models.  This allows joins 
across RDF data and free text documents using Lucene's searching 
capabilities such as fuzzy, word proximity, boolean operations, etc.

Other specialized external models could be implemented, such as a
mapping to a relational schema, an ISAM file or other data structures.

It supports views of RDF models.  This is similar to the functionality
supported in traditional SQL databases.  The views are defined as a
boolean combinations of models.

The queries done on the server are all saved on disk and are streamed to
the user as required.  This means that querying over large datasets (as
much as provided by the 64-bit data structures) is possible.

The queries in TKS are transactional.  A single writing session in
addition to multiple reading sessions can access TKS concurrently
without the reading sessions being required to acquire a global lock
while processing a query.  This completely avoids the possibility of any
lock contention.  In general, each session executes in its own thread.
The lack of lock contention means that the maximum number of active
reading sessions is only limited by the concurrency of the host
operating system and I/O subsystem.

When a session initiates a query, which may involve multiple requests to
the triplestore, it first takes a snapshot of the entire database (an
extremely fast operation, which requires no I/O).  This ensures that all
requests to the triplestore during the processing of the query see the
database in a consistent state.

TKS allows modifications and queries to proceed concurrently with a
backup operation.  The session performing the backup acquires a snapshot
of the entire database as it would if it was performing a query.

Security is also applied against models.  TKS uses Java's JAAS API to
provide the authentication and authorization of models.  Users can be
given the ability to read, write, create and delete triples on a given
model.  We do not secure triples as this would provides too fine
a granularity for reasonable maintaince.

Deployment
==========

TKS has been used in a number of commercial and military deployments in
America.

The current software has been developed over a 2 year period.  The
longest continuous running instance has been up for over a year.

TKS has been tested using 250 simulated clients performing random
real-world recorded queries.

The TKS codebase includes over 155 unit tests, which are automatically
run upon build.

Feedback on Recommendations
===========================

We don't have any feedback on the recommendations.  We've been fairly
happy with the implementations of the standards so far.

The work done on the Jena framework has lead to us being able to focus
on other areas that are probably not a focus of the working groups.

A standard query language would be welcome.  It maybe a little early as
developers are still in the exploration phase of development.  The
commercial requirements placed on TKS has meant iTQL is more expressive
in certain ways when compared to others.  It has also meant that the
syntax is quite ugly.  The querying language should not only include
Squish/SQL like queries but also graph like queries.  Also, we've found
giving models names, contexts, to be highly valuable and should be
considered for standardization.  We also hybrized our query language
into XSLT in order to drive queries which changed based on the type of
the object that is returned.  An XSL template is applied on a context 
basis to the results from a query.

We have also focused on building security placed on models which also
has not been looked at for standardization.
Received on Thursday, 4 September 2003 18:04:47 UTC