Re: RDF file statistics from Alisdair Owens on 2009-10-01 (semantic-web@w3.org from October 2009)

From: Alisdair Owens <alisdair.owens@googlemail.com>
Date: Thu, 1 Oct 2009 12:25:36 +0100
To: Andreas Langegger <al@jku.at>
Cc: semantic-web@w3.org
Message-ID: <70db3d940910010425n5df3775am5e852ca447b8405d@mail.gmail.com>

Hi AndyL,
At the moment this is just for static dataset analysis: I wanted to be able
to find out some basic stats about datasets that run into the billions (or
trillions, if such a dataset exists) of triples, and realised that
extracting that information from an RDF store would be a really slow task.
 I doubt this would be especially useful as a stats generation tool for a
standalone RDF store, although it might be.  I don't know how long
comprehensive stats generation for query optimisation typically takes in an
RDF store: if it's longer than a couple of hours per billion triples, then
this might be helpful :-).  It does have the potential to be made
substantially quicker: I haven't multithreaded it yet, as I haven't really
had the need, but it would parallelise easily.

As for open-sourcing the code, I'm more than happy to share with anyone -
just email me and I'll send it off to you.  It would be nice to know what
people would use it for (just out of curiosity :-) ).  In terms of making an
actual release, I'd like to tidy up the code rather, and have more of a
'finished product' before I do that - hence my sending it off to this list
for suggestions!

-Alisdair

On Tue, Sep 29, 2009 at 7:39 PM, Andreas Langegger <al@jku.at> wrote:

> Dear Alisdair,
> that's great! Do you use this stats for clustered TDB query optimization?
> Will you opensource the code? It would be nice to experiment with that and
> reuse your stats for estimating expected triple pattern cardinalities.
>
> Regards,
> AndyL
>
>
> On Sep 28, 2009, at 6:33 PM, Alisdair Owens wrote:
>
> [apologies if this ends up as a double post.  I sent out this message a
> couple of days ago but it doesn't seem to have shown up in the mailing list,
> so resending]
>
> Hi there,
>
> During the course of my PhD work I've been working on a tool to produce
> stats about RDF files that I thought you guys might find interesting/useful.
>  You can see some example datasets at: http://www.zaltys.net/examineRDF/ .
> It's mostly designed for RDF store creators/maintainers, to validate (or
> challenge :-) ) their assumptions about the structure and characteristics of
> common RDF datasets, and identify unusual edge cases that may result in
> abnormal behaviour.  Hopefully it will also be useful for identifying flaws
> in the realism of automatic data generators, and allow people to better tune
> adaptive data structures in their stores.
>
> I'm aware that the clarity and explanation of the graphs could be rather
> better, but I find myself struggling to find the right words at the moment.
>  If you have any suggestions for improving this (or the output as a whole)
> I'd really appreciate it!
>
> Thanks,
> -Alisdair
>
>
>
> http://www.langegger.at
> ----------------------------------------------------------------------
> Dipl.-Ing.(FH) Andreas Langegger
> FAW - Institute for Application-oriented Knowledge Processing
> Johannes Kepler University Linz
> A-4040 Linz, Altenberger Straße 69
>
>
>
>
>
>
>

Received on Thursday, 1 October 2009 11:26:10 UTC