- From: Stephen Reed <reed@cyc.com>
- Date: Thu, 10 Oct 2002 14:33:11 -0500 (CDT)
- To: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- cc: Alexander Maedche <Maedche@fzi.de>, "John F. Sowa" <sowa@bestweb.net>, <www-rdf-logic@w3.org>, <www-rdf-interest@w3.org>, <seweb-list@cs.vu.nl>, <kaw@swi.psy.uva.nl>
Jeremy, I used the jena RDF parser (ARP) to parse the Open Directory Project RDF structure file in DAML form. As the original RDF file is not RDF-compliant, I first translated it into DAML with a java string-substitution program. The resulting document is a DAML file of over 4 million triples and 400 thousand terms. For my OpenCyc importation experiments I used a small subset of the Open Directory Project RDF structure for "kids and teens". http://dmoz.org/rdf.html I am pleased with the Jena/ARP speed as importing the triples into OpenCyc is by far the dominant time spent. I especially like the streaming nature of Jena/ARP as it allows processing of very large RDF/DAML documents without having to build an in-memory model (which would overflow available java virtual memory). I simply add forward referenced objects to Cyc's knowledge base as named entities as returned by ARP and await their later full definition in the input DAML document. Jena/ARP's convenient identification of anonymous nodes facilitated my handling of DAML restriction objects. My java source code is available from OpenCyc's CVS repository at SourceForge. http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/opencyc/org/opencyc/xml/ -Steve On Thu, 10 Oct 2002, Jeremy Carroll wrote: > Alexander Maedche wrote: > > With respect to existing RDF parsers we were > > confronted with serious performance problems. > > Thus, we implemented a new one being compliant > > to the W3C specification. > > As the developer of the Jena RDF parser (ARP) I read this paragraph with > interest. > > I am aware that my work has some performance issues; however I have > never had a user request to work on the performance. Our analysis has > been that a typical RDF application spends a relatively small percentage > of time in parsing. Thus we have put our development effort in an > emphasis on correct behaviour, tracking the RDFCore WG recommendations > and passing all the new RDF Core parser test cases. > > There are at least two major optimizations missing from the Jena parser: > - in lax mode, switiching off the extensive error checking rather than > merely switching off error messages > - using the Xerces pull parsing interface to allow single threaded > operation (while retaining the architectural advantages of the coparsing > design of the Jena parser) > > I would welcome changes to the Jena code to include these improvements > from anyone who is interested in faster, correct RDF parsing. I look > forward to greater cooperation between the community of open source > semantic web developers. > > At the moment the Jena team would welcome ideas about open source (BSD > license compatible) reasoners that can cope with large subsets of DAML > or OWL. -- =========================================================== Stephen L. Reed phone: 512.342.4036 Cycorp, Suite 100 fax: 512.342.4040 3721 Executive Center Drive email: reed@cyc.com Austin, TX 78731 web: http://www.cyc.com download OpenCyc at http://www.opencyc.org ===========================================================
Received on Thursday, 10 October 2002 15:33:30 UTC