- From: Chris Purcell <cjp39@cam.ac.uk>
- Date: Thu, 21 Oct 2004 12:15:15 +0100
- To: "Phil Dawes" <pdawes@users.sourceforge.net>
- Cc: www-rdf-interest@w3.org
I prune out duplicates in-database, for the following reasons: * Table indices allow constant-time cost per new triple (I think) * I don't need memory proportional to the existing database size As you say, there needs to be locking! How long does a batch import take with your scheme? (Python may well out-perform MySQL!) Does it identify meta-statements? Cheers, Chris > Depends on what is being input - if it's an insert/update of a small > set of assertions, it just uses sql inserts. If it's a large job > (e.g. a batch import of 1000000's of statements) it writes them to a > file and then uses 'LOAD DATA LOCAL INFILE' to bulk import them. > > I had a quick look at your weblog post - I assumed from that that you > are bulk importing as well. I attempt to solve the duplicate id > problem by pre-loading the existing ids into memory, along with hashes > of their values. I can then check each literal/uri value asserted > against the hash to see if it exists in the database. N.B. you need to > lock the table to do this, otherwise you can easily get consistency > problems.
Received on Thursday, 21 October 2004 11:15:18 UTC