Re: RDF file statistics from Andreas Langegger on 2009-10-06 (semantic-web@w3.org from October 2009)

From: Andreas Langegger <al@jku.at>
Date: Tue, 6 Oct 2009 12:32:50 +0200
To: Alisdair Owens <alisdair.owens@googlemail.com>
Cc: semantic-web@w3.org
Message-Id: <FAD5CCE6-6248-4D45-9443-46E773A198FF@jku.at>
Dear Alisdair,

On Oct 1, 2009, at 1:25 PM, Alisdair Owens wrote:
> At the moment this is just for static dataset analysis: I wanted to  
> be able to find out some basic stats about datasets that run into  
> the billions (or trillions, if such a dataset exists) of triples,  
> and realised that extracting that information from an RDF store  
> would be a really slow task.  I doubt this would be especially  
> useful as a stats generation tool for a standalone RDF store,  
> although it might be.  I don't know how long comprehensive stats  
> generation for query optimisation typically takes in an RDF

well, it's a general problem to get up-to-date stats from stores and  
implementations if you need them outside for query federation, etc.
In a RDBMS not all attributes are indexed, so it need stats about  
selectivites/distributions of all the attributes for optimizing  
queries. But in an RDF store you have the index over all triples/quads  
and most RDF stores don't use stats because they can get cardinality  
estimations for triple patterns from the index also. E.g. they just  
scan and count the entries in the b-tree. I don't know what they do in  
case of hash indexes though.

If you calc stats from outside you can only do snapshots and if data  
is highly dynamic the stats might not be accurate all time.

> As for open-sourcing the code, I'm more than happy to share with  
> anyone - just email me and I'll send it off to you.  It would be  
> nice to know what people would use it for (just out of  
> curiosity :-) ).  In terms of making an actual release, I'd like to  
> tidy up the code rather, and have more of a 'finished product'  
> before I do that - hence my sending it off to this list for  
> suggestions!

since I'm very busy right now and I don't have time to seriously play  
around with it, I'll come back to you at the end of the year if that  
is okay.

Thanks and best regards,
AndyL

>
> -Alisdair
>
>
> On Tue, Sep 29, 2009 at 7:39 PM, Andreas Langegger <al@jku.at> wrote:
> Dear Alisdair,
>
> that's great! Do you use this stats for clustered TDB query  
> optimization? Will you opensource the code? It would be nice to  
> experiment with that and reuse your stats for estimating expected  
> triple pattern cardinalities.
>
> Regards,
> AndyL
>
>
> On Sep 28, 2009, at 6:33 PM, Alisdair Owens wrote:
>
>> [apologies if this ends up as a double post.  I sent out this  
>> message a couple of days ago but it doesn't seem to have shown up  
>> in the mailing list, so resending]
>>
>> Hi there,
>>
>> During the course of my PhD work I've been working on a tool to  
>> produce stats about RDF files that I thought you guys might find  
>> interesting/useful.  You can see some example datasets at: http://www.zaltys.net/examineRDF/ 
>>  . It's mostly designed for RDF store creators/maintainers, to  
>> validate (or challenge :-) ) their assumptions about the structure  
>> and characteristics of common RDF datasets, and identify unusual  
>> edge cases that may result in abnormal behaviour.  Hopefully it  
>> will also be useful for identifying flaws in the realism of  
>> automatic data generators, and allow people to better tune adaptive  
>> data structures in their stores.
>>
>> I'm aware that the clarity and explanation of the graphs could be  
>> rather better, but I find myself struggling to find the right words  
>> at the moment.  If you have any suggestions for improving this (or  
>> the output as a whole) I'd really appreciate it!
>>
>> Thanks,
>> -Alisdair
>
>
> http://www.langegger.at
> ----------------------------------------------------------------------
> Dipl.-Ing.(FH) Andreas Langegger
> FAW - Institute for Application-oriented Knowledge Processing
> Johannes Kepler University Linz
> A-4040 Linz, Altenberger Straße 69
>
>
>
>
>
>
>


http://www.langegger.at
----------------------------------------------------------------------
Dipl.-Ing.(FH) Andreas Langegger
FAW - Institute for Application-oriented Knowledge Processing
Johannes Kepler University Linz
A-4040 Linz, Altenberger Straße 69
Received on Tuesday, 6 October 2009 10:33:27 UTC