- From: Orri Erling <erling@xs4all.nl>
- Date: Mon, 5 Jan 2009 21:39:52 +0100
- To: "'Yves Raimond'" <yves.raimond@gmail.com>, "'Bernhard Schandl'" <bernhard.schandl@univie.ac.at>
- Cc: "'Richard Cyganiak'" <richard@cyganiak.de>, "'Aldo Bucchi'" <aldo.bucchi@gmail.com>, <public-lod@w3.org>
Hi I think a reasonable level of detail for our present case is whatever a query optimizer can use for guessing cardinnalities. In the case of the New York birthplace question, this would mean saying how many triples with a birthplace predicate exist in the set. If we wish to be fancier, we can include the count of distinct birthplaces. For ascalar data type, we can add maximum and minimum. Histograms of distribution are probably more trouble than worth at this stage. Count of triples with a given predicate and count of distinct values are however of prime import for query optimization. So if VOID is to enable federation of SPARQL, then it ought to expose these. Exposing these is not a lot of trouble but is really of make or break value as concerns the possibility of VOID guiding SPARQL federated queries. As pointed out earlier, a machine will not know if something is about New York, Al Pacino or birth places if the example says Al Pacino was born in New York. Again, for query optimization, it is useful to say that for 10 million people there are 10000 birthplaces. Thus, given the predicate, the distinct S and O counts. Since there are relatively few predicates but a greatmany S and O values, VOID should likely limit itself to giving S and O cardinalities per predicate, not for example the number of distinct predicates for a literal S. So for the machine processing, our greatest wish is to have the cardinalities mentioned above in the spec now, else we would have to add them ourselves as extensions when we do SPARQL end point federation in Virtuoso, not so far from now. If we have a question of approximate query evaluation, even this can be guided by the cardinalities, going to those sources first which have the most matches of a predicate. Further precalculating and storing cardinalities of joins (like all x with birth place NY and birth year 1965) is likely not profitable. The latter point brings us to the question of human discovery and quick assessment of the contents of a data set. The human user may be looking for acombination of properties. For this, already for some time, faceted browsing has been a popular approach. Its downside has been high computation cost. As we are about to show, using partial query evaluation, it is possible to provide relevant faceted views over arbitrarily much data, with accuracy proportional to time expended. This touches on the human discobery and browsing of data sets primarily but we could also imagine an automatic agent submitting complex queries to remote end points just for the purpose of retrieving an approximate count of answers or for knowing how many answers per second an end point is likely to produce for a given query. All these things can be used for query planning. I think that VOID, in addition to serving query planning, should be renderable into a couple of human readable report formats: For example, a map of data set interlinkage. This it provides right now. Further, inside the data set, the count of distinct predicates and distinct objects and subjects per predicates should be quite easy to visualize as regular business graphics. Such would answer questions like whether some data set is about New York, birth places in general or celebrities in general. The machine will likely not have the contextual acuity for this at first but the human user will. Anyway, OpenLink's next promise is low cost facets on infinite data, just what we need for the present range of discovery issues. Stay tuned. Regards Orri
Received on Monday, 5 January 2009 20:41:00 UTC