ANN: RDFStats for SPARQL endpoints

Hello,

for those interested in statistics for RDF data behind SPARQL  
endpoints. Today I released a request-for-comments 0.1-alpha version  
of RDFStats. It's actually a sub-project of SemWIQ, the Semantic Web  
Integrator and Query Engine to be released when it's ready to go  
public ;-)):

http://semwiq.faw.uni-linz.ac.at/node/9

Some facts:

     * RDFStats generates statistics for datasets accessible over the  
SPARQL protocol. To stay as flexible as possible, the generator runs  
as a stand-alone process (e.g. beside a native RDF Store, D2R-Server  
instance or any other SPARQL end-point).
     * It is based on the Jena Semantic Web Framework.
     * It is basically part of SemWIQ and used by the query optimizer,  
but it is released separately because it is regarded as useful for  
other applications.
     * The focus of the current statistics is for query optimizers and  
similar programs. It's currently not very usable for visualization of  
data distributions for humans. Additional histogram generators could  
be added in future to support that.
     * Statistics are generated by executing several SPARQL queries  
against the end-point, which is approx. 33% faster than pulling out  
all triples in a naive way. Nevertheless, generation is costly and  
RDFStats should run as close as possible to the endpoint (best on the  
same host or subnet).
     * Statistics generated by RDFStats could complement SPARQL end- 
point descriptions and capabilities (see voiD).
     * The generated data includes for each class:
           o the total number of instances and optionally the URIs of  
them
           o property statistics for each (class, property, datatype  
range): depending on the range, there are different histograms  
available (e.g. byte/short/integer/long/float/double/boolean/dateTime/ 
string histogram)
     * Histograms are Base64-encoded. As part of the JAR, there is a  
special RDFStatsModel which should be used to access histogram data.

Especially the string histograms are only useful for some applications  
like the SemWIQ optimizer. The algorithm is a trade-off between speed  
and max. information (using the maximum of the preferred amount of  
histogram bins).

Because this is only a side-project for SemWIQ, support will be very  
low and I hope that sb else would stick to statistics for RDF and  
SPARQL-endpoints in future. There are so many issues that I cannot  
work further into this direction. For SemWIQ, the current impl is  
sufficient.

Regards,
AndyL

Hey! And be sure to check out http://www.webofdata.info ;-)

 > > > Web of Data Practitioners Days / Oct 22-23 / Vienna < < <

----------------------------------------------------------------------
Dipl.-Ing.(FH) Andreas Langegger
Institute for Applied Knowledge Processing
Johannes Kepler University Linz
A-4040 Linz, Altenberger Straße 69
http://www.langegger.at

Received on Saturday, 9 August 2008 13:38:27 UTC