Re: N-Triples Parser for Python

I prune out duplicates in-database, for the following reasons:

* Table indices allow constant-time cost per new triple (I think)
* I don't need memory proportional to the existing database size

As you say, there needs to be locking!

How long does a batch import take with your scheme? (Python may well 
out-perform MySQL!) Does it identify meta-statements?

Cheers,
Chris

> Depends on what is being input - if it's an insert/update of a small
> set of assertions, it just uses sql inserts. If it's a large job
> (e.g. a batch import of 1000000's of statements) it writes them to a
> file and then uses 'LOAD DATA LOCAL INFILE' to bulk import them.
>
> I had a quick look at your weblog post - I assumed from that that you
> are bulk importing as well. I attempt to solve the duplicate id
> problem by pre-loading the existing ids into memory, along with hashes
> of their values. I can then check each literal/uri value asserted
> against the hash to see if it exists in the database. N.B. you need to
> lock the table to do this, otherwise you can easily get consistency
> problems.

Received on Thursday, 21 October 2004 11:15:18 UTC