- From: Alisdair Owens <alisdair.owens@googlemail.com>
- Date: Thu, 1 Oct 2009 12:25:36 +0100
- To: Andreas Langegger <al@jku.at>
- Cc: semantic-web@w3.org
- Message-ID: <70db3d940910010425n5df3775am5e852ca447b8405d@mail.gmail.com>
Hi AndyL, At the moment this is just for static dataset analysis: I wanted to be able to find out some basic stats about datasets that run into the billions (or trillions, if such a dataset exists) of triples, and realised that extracting that information from an RDF store would be a really slow task. I doubt this would be especially useful as a stats generation tool for a standalone RDF store, although it might be. I don't know how long comprehensive stats generation for query optimisation typically takes in an RDF store: if it's longer than a couple of hours per billion triples, then this might be helpful :-). It does have the potential to be made substantially quicker: I haven't multithreaded it yet, as I haven't really had the need, but it would parallelise easily. As for open-sourcing the code, I'm more than happy to share with anyone - just email me and I'll send it off to you. It would be nice to know what people would use it for (just out of curiosity :-) ). In terms of making an actual release, I'd like to tidy up the code rather, and have more of a 'finished product' before I do that - hence my sending it off to this list for suggestions! -Alisdair On Tue, Sep 29, 2009 at 7:39 PM, Andreas Langegger <al@jku.at> wrote: > Dear Alisdair, > that's great! Do you use this stats for clustered TDB query optimization? > Will you opensource the code? It would be nice to experiment with that and > reuse your stats for estimating expected triple pattern cardinalities. > > Regards, > AndyL > > > On Sep 28, 2009, at 6:33 PM, Alisdair Owens wrote: > > [apologies if this ends up as a double post. I sent out this message a > couple of days ago but it doesn't seem to have shown up in the mailing list, > so resending] > > Hi there, > > During the course of my PhD work I've been working on a tool to produce > stats about RDF files that I thought you guys might find interesting/useful. > You can see some example datasets at: http://www.zaltys.net/examineRDF/ . > It's mostly designed for RDF store creators/maintainers, to validate (or > challenge :-) ) their assumptions about the structure and characteristics of > common RDF datasets, and identify unusual edge cases that may result in > abnormal behaviour. Hopefully it will also be useful for identifying flaws > in the realism of automatic data generators, and allow people to better tune > adaptive data structures in their stores. > > I'm aware that the clarity and explanation of the graphs could be rather > better, but I find myself struggling to find the right words at the moment. > If you have any suggestions for improving this (or the output as a whole) > I'd really appreciate it! > > Thanks, > -Alisdair > > > > http://www.langegger.at > ---------------------------------------------------------------------- > Dipl.-Ing.(FH) Andreas Langegger > FAW - Institute for Application-oriented Knowledge Processing > Johannes Kepler University Linz > A-4040 Linz, Altenberger Straße 69 > > > > > > >
Received on Thursday, 1 October 2009 11:26:10 UTC