- From: Andreas Langegger <al@jku.at>
- Date: Sat, 9 Aug 2008 15:36:06 +0200
- To: Semantic Web <semantic-web@w3.org>, public-lod@w3.org, public-rdf-dawg-comments-request@w3.org
Hello,
for those interested in statistics for RDF data behind SPARQL
endpoints. Today I released a request-for-comments 0.1-alpha version
of RDFStats. It's actually a sub-project of SemWIQ, the Semantic Web
Integrator and Query Engine to be released when it's ready to go
public ;-)):
http://semwiq.faw.uni-linz.ac.at/node/9
Some facts:
* RDFStats generates statistics for datasets accessible over the
SPARQL protocol. To stay as flexible as possible, the generator runs
as a stand-alone process (e.g. beside a native RDF Store, D2R-Server
instance or any other SPARQL end-point).
* It is based on the Jena Semantic Web Framework.
* It is basically part of SemWIQ and used by the query optimizer,
but it is released separately because it is regarded as useful for
other applications.
* The focus of the current statistics is for query optimizers and
similar programs. It's currently not very usable for visualization of
data distributions for humans. Additional histogram generators could
be added in future to support that.
* Statistics are generated by executing several SPARQL queries
against the end-point, which is approx. 33% faster than pulling out
all triples in a naive way. Nevertheless, generation is costly and
RDFStats should run as close as possible to the endpoint (best on the
same host or subnet).
* Statistics generated by RDFStats could complement SPARQL end-
point descriptions and capabilities (see voiD).
* The generated data includes for each class:
o the total number of instances and optionally the URIs of
them
o property statistics for each (class, property, datatype
range): depending on the range, there are different histograms
available (e.g. byte/short/integer/long/float/double/boolean/dateTime/
string histogram)
* Histograms are Base64-encoded. As part of the JAR, there is a
special RDFStatsModel which should be used to access histogram data.
Especially the string histograms are only useful for some applications
like the SemWIQ optimizer. The algorithm is a trade-off between speed
and max. information (using the maximum of the preferred amount of
histogram bins).
Because this is only a side-project for SemWIQ, support will be very
low and I hope that sb else would stick to statistics for RDF and
SPARQL-endpoints in future. There are so many issues that I cannot
work further into this direction. For SemWIQ, the current impl is
sufficient.
Regards,
AndyL
Hey! And be sure to check out http://www.webofdata.info ;-)
> > > Web of Data Practitioners Days / Oct 22-23 / Vienna < < <
----------------------------------------------------------------------
Dipl.-Ing.(FH) Andreas Langegger
Institute for Applied Knowledge Processing
Johannes Kepler University Linz
A-4040 Linz, Altenberger Straße 69
http://www.langegger.at
Received on Saturday, 9 August 2008 13:38:27 UTC