- From: Andreas Langegger <al@jku.at>
- Date: Sat, 9 Aug 2008 15:36:06 +0200
- To: Semantic Web <semantic-web@w3.org>, public-lod@w3.org, public-rdf-dawg-comments-request@w3.org
Hello, for those interested in statistics for RDF data behind SPARQL endpoints. Today I released a request-for-comments 0.1-alpha version of RDFStats. It's actually a sub-project of SemWIQ, the Semantic Web Integrator and Query Engine to be released when it's ready to go public ;-)): http://semwiq.faw.uni-linz.ac.at/node/9 Some facts: * RDFStats generates statistics for datasets accessible over the SPARQL protocol. To stay as flexible as possible, the generator runs as a stand-alone process (e.g. beside a native RDF Store, D2R-Server instance or any other SPARQL end-point). * It is based on the Jena Semantic Web Framework. * It is basically part of SemWIQ and used by the query optimizer, but it is released separately because it is regarded as useful for other applications. * The focus of the current statistics is for query optimizers and similar programs. It's currently not very usable for visualization of data distributions for humans. Additional histogram generators could be added in future to support that. * Statistics are generated by executing several SPARQL queries against the end-point, which is approx. 33% faster than pulling out all triples in a naive way. Nevertheless, generation is costly and RDFStats should run as close as possible to the endpoint (best on the same host or subnet). * Statistics generated by RDFStats could complement SPARQL end- point descriptions and capabilities (see voiD). * The generated data includes for each class: o the total number of instances and optionally the URIs of them o property statistics for each (class, property, datatype range): depending on the range, there are different histograms available (e.g. byte/short/integer/long/float/double/boolean/dateTime/ string histogram) * Histograms are Base64-encoded. As part of the JAR, there is a special RDFStatsModel which should be used to access histogram data. Especially the string histograms are only useful for some applications like the SemWIQ optimizer. The algorithm is a trade-off between speed and max. information (using the maximum of the preferred amount of histogram bins). Because this is only a side-project for SemWIQ, support will be very low and I hope that sb else would stick to statistics for RDF and SPARQL-endpoints in future. There are so many issues that I cannot work further into this direction. For SemWIQ, the current impl is sufficient. Regards, AndyL Hey! And be sure to check out http://www.webofdata.info ;-) > > > Web of Data Practitioners Days / Oct 22-23 / Vienna < < < ---------------------------------------------------------------------- Dipl.-Ing.(FH) Andreas Langegger Institute for Applied Knowledge Processing Johannes Kepler University Linz A-4040 Linz, Altenberger Straße 69 http://www.langegger.at
Received on Saturday, 9 August 2008 13:38:27 UTC