- From: Andrew Newman <anewman@pisoftware.com>
- Date: Fri, 05 Sep 2003 08:04:26 +1000
- To: Dave Beckett <dave.beckett@bristol.ac.uk>
- Cc: www-rdf-interest@w3.org
Report on TKS (Tucana Knowledge Store) ====================================== Implementation ============== TKS is generic store for storing graph like data structures and has been optimised for use with RDF although it could be applied to other applications such as Topic Maps. It is written in Java 1.4 with extensive use of NIO to provide a fast, reliable, transactional data store. TKS has been available commercially for two years, but is in the final stages of release under the Mozilla Public License, version 1.1. Release under the MPL is anticipated in October, 2003. RDF Import ========== A modified Jena RDF parser, version 1.4, is used to import RDF. A port to Jena 2.0 is underway. It currently imports a draft datatyping standard rather than the current recommendation. TKS supports URIs, strings, numbers (floats), dates and date/times. Timezone support has not yet been added. The addition of furtherdatatype support is part of the future roadmap of development. We plan to implement, at least, a subset of the XML Schema datatypes. TKS has been used to import various large graphs. We regularly use Wordnet, a modified dmoz, and our own production applications on a daily basis for the storage, querying, back-up and maintenance of this data. TKS datastore contains a node pool, string pool and graph. With these data structures it is difficult to say what effect the amount of RDF has on disk usage. Things that can effect this include the unique number of strings, number of differing predicates and the number of blank nodes. The following analysis is based on "world.rdf" which contains RDF/XML generated from publicly available geographic data from USGS. This data is probably atypical of most RDF as it contains a high percentage of unique literals and no blank nodes. The "world.rdf" contains: 36332560 triples, 9431582 RDF nodes, 4227899 literals (SPStrings), 5203683 resources (SPURIs) and 0 blank nodes. This created on the 32-bit version of TKS the following file sizes: 59040410 graph avl files 4650567680 graph block files 4709608090 graph total 113178984 stringPool.sp_ns 382465468 stringPool.sp_avl 433066624 string block files 5638319166 Total 554684862 Total mapped on 32-bit platforms This was loaded in ~240 minutes which gives 2523 triples/second. This was on a 1GHz Pentium III with 512MB RAM and running Sun's JDK 1.4.0. Restoring from backup, using TKS's own data structures, results in load times approximately 2-3 times faster than directly parsing RDF/XML. There is quite a lot of room for optimisation when loading data. The current version of TKS has resulted in roughly double the storage requirements of the graph (as a result of going from 32 to 64 bits). Of the 105 files that comprise a TKS database while TKS is running only the six triple block files and the twenty string block files use explicit (seek/read/write) I/O, all the rest use memory-mapped I/O. On 64-bit platforms, by default all 105 files are memory mapped. 64-bit platforms ---------------- Windows, Solaris and Linux all support 64 bit offsets and files larger than the old 2GB limit. Linux supports files up to nearly 2 TB in size on 32 bit architectures although other problems currently limit filesystem sizes to 1 TB. All relevant fields, of the current store, in-memory and on-disk data structures are 64 bits wide, thus ensuring that TKS can store very large amounts of data up to the limits imposed by the host operating system. RDF Export ========== We currently only provide the results from queries as sum of products rather than as RDF/XML. We expect proper export of RDF/XML to happen relatively soon. Local APIs ========== The main interfaces to the database include a query interpreter or Java API. The local API also provides some graph like querying. We have a weighted related to" query which finds resources that are similar to other resources based on common arcs that they each have. A queries can be applied to filter the arcs, reducing the "predicate-object" pairs. The "predicate-object" pairs can be one, two or more levels of "predicate-object" pairs. For example, if a date is a resource which has a day, month and year. You can pick up resources with the same day and month or month and year in common. Or you can have documents which have the same day, month in common. The "similar to" query is the reverse of "weighted related to" where a given literal is used to find other literals that are found to be similar via other resources. A Jena layer is a high priority in a future release of TKS. Remote Interfaces ================= Queries can be issued directly using the Java RMI layer or through a number of interfaces including Perl, SOAP, COM, HTTP (via a servlet), or JSP tags. The current development version of TKS provides a query engine that streams the results in fixed pages to the client. A Web-based query interface is also provided. Entailments =========== We currently do not support an inferencing layer. We anticipate that with the inclusion of Jena 2.0 we will get most of the functionality that it provides. We also anticipate that optimisations based ontop of our triple layer should offer improvements over the existing Jena implementation. We intend to apply some of these entailment rules at store time and others at query time. Query Support ============= iTQL is a Squish-like language, with an emphasis on grouping statements using models. It supports aliasing, backups, transactions, creation, deletion, and distributed queries. iTQL commands are optimised through the use of heuristics built up over our current use cases based on application requirements. We've implemented predicates in the query language which provide support for datatypes using proprietary tokens such as "<tks:lt>" for numbers and "<tks:before>" for dates. We support the querying of Lucene as RDF models. This allows joins across RDF data and free text documents using Lucene's searching capabilities such as fuzzy, word proximity, boolean operations, etc. Other specialized external models could be implemented, such as a mapping to a relational schema, an ISAM file or other data structures. It supports views of RDF models. This is similar to the functionality supported in traditional SQL databases. The views are defined as a boolean combinations of models. The queries done on the server are all saved on disk and are streamed to the user as required. This means that querying over large datasets (as much as provided by the 64-bit data structures) is possible. The queries in TKS are transactional. A single writing session in addition to multiple reading sessions can access TKS concurrently without the reading sessions being required to acquire a global lock while processing a query. This completely avoids the possibility of any lock contention. In general, each session executes in its own thread. The lack of lock contention means that the maximum number of active reading sessions is only limited by the concurrency of the host operating system and I/O subsystem. When a session initiates a query, which may involve multiple requests to the triplestore, it first takes a snapshot of the entire database (an extremely fast operation, which requires no I/O). This ensures that all requests to the triplestore during the processing of the query see the database in a consistent state. TKS allows modifications and queries to proceed concurrently with a backup operation. The session performing the backup acquires a snapshot of the entire database as it would if it was performing a query. Security is also applied against models. TKS uses Java's JAAS API to provide the authentication and authorization of models. Users can be given the ability to read, write, create and delete triples on a given model. We do not secure triples as this would provides too fine a granularity for reasonable maintaince. Deployment ========== TKS has been used in a number of commercial and military deployments in America. The current software has been developed over a 2 year period. The longest continuous running instance has been up for over a year. TKS has been tested using 250 simulated clients performing random real-world recorded queries. The TKS codebase includes over 155 unit tests, which are automatically run upon build. Feedback on Recommendations =========================== We don't have any feedback on the recommendations. We've been fairly happy with the implementations of the standards so far. The work done on the Jena framework has lead to us being able to focus on other areas that are probably not a focus of the working groups. A standard query language would be welcome. It maybe a little early as developers are still in the exploration phase of development. The commercial requirements placed on TKS has meant iTQL is more expressive in certain ways when compared to others. It has also meant that the syntax is quite ugly. The querying language should not only include Squish/SQL like queries but also graph like queries. Also, we've found giving models names, contexts, to be highly valuable and should be considered for standardization. We also hybrized our query language into XSLT in order to drive queries which changed based on the type of the object that is returned. An XSL template is applied on a context basis to the results from a query. We have also focused on building security placed on models which also has not been looked at for standardization.
Received on Thursday, 4 September 2003 18:04:47 UTC