RE: Granular dereferencing ( prop by prop ) using REST + LinkedData; Ideas?

Hi

I think a reasonable level of detail for our present case is whatever a
query optimizer can use for guessing cardinnalities.  

In the case of the New York birthplace question, this would mean saying how
many triples with a birthplace  predicate exist in the set.  If 

we wish to be fancier, we can include the count of distinct birthplaces.
For ascalar data type, we can add maximum and minimum.  Histograms 

of distribution are probably more trouble than worth at this stage.

Count of triples with a given predicate and count of distinct values are
however of prime import 
for query optimization.  So if VOID is to enable federation of SPARQL, then
it ought to expose these.  Exposing these is not a lot of 

trouble but is really of make or break value as concerns the possibility of
VOID guiding SPARQL federated queries.

As pointed out earlier, a machine will not know if something is about New
York, Al Pacino or birth places if the example says Al Pacino was 

born in New York.

Again, for query optimization, it is useful to say that for 10 million
people there are 10000 birthplaces.  Thus, given the predicate, the 

distinct S and O counts.  Since there are relatively few predicates but a
greatmany S and O values, VOID should likely limit itself to 

giving S and O cardinalities per predicate, not for example the number of
distinct predicates for a literal S.

So for the machine processing, our greatest wish is to have the
cardinalities mentioned above in the spec now, else we would have to add 

them ourselves as extensions when we do SPARQL end point federation in
Virtuoso, not so far from now.

If we have  a question of approximate query evaluation, even this  can be
guided by the cardinalities, going to those sources first which 

have the most matches of a predicate.  Further precalculating and storing
cardinalities of joins (like all x with birth place NY and birth 

year 1965) is likely not profitable.


The latter point brings us to the question of human discovery and quick
assessment of the contents of a data set.  The human user may be 

looking for acombination of properties.  For this, already for some time,
faceted browsing has been a popular approach.  Its downside has 

been high computation cost.  As we are about to show, using partial query
evaluation, it is possible to provide relevant faceted views over 

arbitrarily much data, with accuracy proportional to time expended.  This
touches on the human discobery and browsing of data sets primarily 

but we could also imagine an automatic agent submitting complex queries to
remote end points just for the purpose of retrieving an 

approximate count of answers or for knowing how many answers per second an
end point is likely to produce for a given query. All these 

things can be used for query planning.

I think that VOID, in addition to serving query planning, should be
renderable into a couple of human readable report formats:  For example, 

a map of data set interlinkage.  This it provides  right now.  Further,
inside the data set, the count of distinct predicates and distinct 

objects and subjects per predicates should be quite easy to visualize as
regular business graphics.  Such would answer questions like 

whether some data set is about New York, birth places in general or
celebrities  in general.  The machine will likely not have the 

contextual acuity for this at first but the human user will.


Anyway, OpenLink's next promise is low cost facets on infinite data, just
what we need for the present range of discovery issues.  Stay tuned.



Regards
Orri
















 

Received on Monday, 5 January 2009 20:41:00 UTC