I prune out duplicates in-database, for the following reasons: * Table indices allow constant-time cost per new triple (I think) * I don't need memory proportional to the existing database size As you say, there needs to be locking! How long does a batch import take with your scheme? (Python may well out-perform MySQL!) Does it identify meta-statements? Cheers, Chris > Depends on what is being input - if it's an insert/update of a small > set of assertions, it just uses sql inserts. If it's a large job > (e.g. a batch import of 1000000's of statements) it writes them to a > file and then uses 'LOAD DATA LOCAL INFILE' to bulk import them. > > I had a quick look at your weblog post - I assumed from that that you > are bulk importing as well. I attempt to solve the duplicate id > problem by pre-loading the existing ids into memory, along with hashes > of their values. I can then check each literal/uri value asserted > against the hash to see if it exists in the database. N.B. you need to > lock the table to do this, otherwise you can easily get consistency > problems.Received on Thursday, 21 October 2004 11:15:18 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:20:24 GMT