Re: AW: (SeWeb) KAON - KArlsruhe ONtology and Semantic Web Infrastructure from Stephen Reed on 2002-10-10 (www-rdf-interest@w3.org from October 2002)

From: Stephen Reed <reed@cyc.com>
Date: Thu, 10 Oct 2002 14:33:11 -0500 (CDT)
To: Jeremy Carroll <jjc@hplb.hpl.hp.com>
cc: Alexander Maedche <Maedche@fzi.de>, "John F. Sowa" <sowa@bestweb.net>, <www-rdf-logic@w3.org>, <www-rdf-interest@w3.org>, <seweb-list@cs.vu.nl>, <kaw@swi.psy.uva.nl>
Message-ID: <Pine.LNX.4.33.0210101211040.2786-100000@crapgame.cyc.com>

Jeremy,

I used the jena RDF parser (ARP) to parse the Open Directory Project
RDF structure file in DAML form.  As the original RDF file is not
RDF-compliant, I first translated it into DAML with a java
string-substitution program.  The resulting document is a DAML file of
over 4 million triples and 400 thousand terms.  For my OpenCyc
importation experiments I used a small subset of the Open Directory
Project RDF structure for "kids and teens".

http://dmoz.org/rdf.html

I am pleased with the Jena/ARP speed as importing the triples into OpenCyc
is by far the dominant time spent.  I especially like the streaming nature of
Jena/ARP as it allows processing of very large RDF/DAML documents without
having to build an in-memory model (which would overflow available java
virtual memory).  I simply add forward referenced objects to Cyc's
knowledge base as named entities as returned by ARP and await their later
full definition in the input DAML document. Jena/ARP's convenient
identification of anonymous nodes facilitated my handling of DAML
restriction objects.

My java source code is available from OpenCyc's CVS repository at
SourceForge.

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/opencyc/org/opencyc/xml/

-Steve

On Thu, 10 Oct 2002, Jeremy Carroll wrote:
> Alexander Maedche wrote:
> > With respect to existing RDF parsers we were
> > confronted with serious performance problems.
> > Thus, we implemented a new one being compliant
> > to the W3C specification.
>
> As the developer of the Jena RDF parser (ARP) I read this paragraph with
> interest.
>
> I am aware that my work has some performance issues; however I have
> never had a user request to work on the performance. Our analysis has
> been that a typical RDF application spends a relatively small percentage
> of time in parsing. Thus we have put our development effort in an
> emphasis on correct behaviour, tracking the RDFCore WG recommendations
> and passing all the new RDF Core parser test cases.
>
> There are at least two major optimizations missing from the Jena parser:
> - in lax mode, switiching off the extensive error checking rather than
> merely switching off error messages
> - using the Xerces pull parsing interface to allow single threaded
> operation (while retaining the architectural advantages of the coparsing
> design of the Jena parser)
>
> I would welcome changes to the Jena code to include these improvements
> from anyone who is interested in faster, correct RDF parsing. I look
> forward to greater cooperation between the community of open source
> semantic web developers.
>
> At the moment the Jena team would welcome ideas about open source (BSD
> license compatible) reasoners that can cope with large subsets of DAML
> or OWL.

-- 
===========================================================
Stephen L. Reed                  phone:  512.342.4036
Cycorp, Suite 100                  fax:  512.342.4040
3721 Executive Center Drive      email:  reed@cyc.com
Austin, TX 78731                   web:  http://www.cyc.com
         download OpenCyc at http://www.opencyc.org
===========================================================

Received on Thursday, 10 October 2002 15:33:30 UTC