- From: Orri Erling <erling@xs4all.nl>
- Date: Tue, 14 Apr 2009 09:03:54 +0200
- To: "'Bijan Parsia'" <bparsia@cs.manchester.ac.uk>, "'RDF Data Access Working Group'" <public-rdf-dawg@w3.org>
Hi We might see the rationale for parametrized inference like this: Since the web of data is about exchange, integration and unexpected or serendipitous repurposing of data, of which most does not originate with the user or the user's organization, we expect the data to be of variable quality and to use variable terms. There are two angles for justifying inference control: 1. compensating for heterogeneity and 2. increasing abstraction. and owl:sameAs at the instance level. I have at various times blogged against indiscriminately materializing entailment of data that is likely of uncertain reliability and can be malicious. If dealing with web data, like social networks, where the volume can be large and the reliability so and so, we cannot very well expect the publisher of the data set to publish the data with different materialized entailments of diverse things people might possibly wish to have inferred. For example, the Myspace mapping to RDF was estimated to be 12 bn triples. In these cases, we see a practical justification for leaving what inference to apply up to the query writer, not the data publisher. We have for a long time done basic inference like subclasses, subproperties, sameAs and more recently identity of things sharing an IFP value at run time. We have not striven to be complete with respect to RDFS or any level of OWL but have done what we found straightforward and efficient. At some point we will add Prolog style backward chaining to SQL/SPARQL: A SPARQL query can mention an n-tuple and have that matched against rule heads. In all the cases that we have tried, except for some cases of subclass membership checking, inference has some cost. Either it kills working set by doubling the data volume or it takes doing joins at least in duplicate. Transitivity also makes query plans somewhat harder to parallelize. So, since performance is our primary interest, we would say that the most important and easiest to agree upon aspect of inference control is a means of saying that a triple pattern may only be matched by a physically existing triple. For big queries where the developer knows the data this can be useful. But only for knowledgeable users. The experience of having the possibility of declaring whether subclasses and such will be considered triple by triple is that it is tedious to write such and think them through, also easy to get wrong. So in practice, the declarations are at the query level. For special cases, these should still be possible triple by triple. A lot depends on the use case. If one has an online application where simple queries must run fast, then a stored procedure is likely best, so any inference can be procedurally stated there. If one has analytics queries over 100's of G of data, then whether sameAs is considered for matching a particular pattern can make an order of magnitude of difference. A join considering sameAs takes about 5x the time of a join that does not, plus loss of locality can drive things further out of whack. Thus, for big queries to be able to use any inference at all, it will be important to be able to have fine grain control. Furthermore, talking of web data and identities therein, what sameAs will be believed will in practice have to be qualified further, for example only sameAs from graphs I say are reliable will enter into the result. For data web things, subclasses and subproperties are sometimes nice, sometimes sameAs. In practice we have what we find useful but standardizing this will be quite impossible. This does not for one thing correspond to any of the defined entailment regimes. . Applications will probably rely on task specific rules. Maybe the following extensibility hook is possible: Some way to say, for anything delimited by braces, that 1. no inferred triples of any sort will be considered in addition to physical triples in the evaluation, 2. that inferred triples derived from inferences described in xx will be considered in addition to physical triples. A physical triple is a triple loaded into the store by SPARUL or an implementation dependent CLI or a means of loading a file. If this is a mapped relational database, then triples declared in the mapping will be considered physical even though they are not in the strict sense. xx will be a URI whose interpretation is implementation dependent. If xx refers to an RDFS or OWL ontology,then this ontology itself consists of triples. These triples will not be visible to the query, unless these triples are explicitly loaded into one of the graphs considered for the query. Using entailment described in a graph does per se not add the triples of said graph into the data set of the query. Xx might also refer to something that is not represented as triples, such as RIF rules. Orri
Received on Tuesday, 14 April 2009 07:04:56 UTC