Re: N-Triples Parser for Python from Chris Purcell on 2004-10-21 (www-rdf-interest@w3.org from October 2004)

From: Chris Purcell <cjp39@cam.ac.uk>
Date: Thu, 21 Oct 2004 12:15:15 +0100
To: "Phil Dawes" <pdawes@users.sourceforge.net>
Cc: www-rdf-interest@w3.org
Message-Id: <7B81780C-2352-11D9-8F0F-000A957B97EE@cam.ac.uk>

I prune out duplicates in-database, for the following reasons:

* Table indices allow constant-time cost per new triple (I think)
* I don't need memory proportional to the existing database size

As you say, there needs to be locking!

How long does a batch import take with your scheme? (Python may well 
out-perform MySQL!) Does it identify meta-statements?

Cheers,
Chris

> Depends on what is being input - if it's an insert/update of a small
> set of assertions, it just uses sql inserts. If it's a large job
> (e.g. a batch import of 1000000's of statements) it writes them to a
> file and then uses 'LOAD DATA LOCAL INFILE' to bulk import them.
>
> I had a quick look at your weblog post - I assumed from that that you
> are bulk importing as well. I attempt to solve the duplicate id
> problem by pre-loading the existing ids into memory, along with hashes
> of their values. I can then check each literal/uri value asserted
> against the hash to see if it exists in the database. N.B. you need to
> lock the table to do this, otherwise you can easily get consistency
> problems.

Received on Thursday, 21 October 2004 11:15:18 UTC