RE: Parameterized Inference - starting mail discussion from Orri Erling on 2009-04-14 (public-rdf-dawg@w3.org from April to June 2009)

From: Orri Erling <erling@xs4all.nl>
Date: Tue, 14 Apr 2009 09:03:54 +0200
To: "'Bijan Parsia'" <bparsia@cs.manchester.ac.uk>, "'RDF Data Access Working Group'" <public-rdf-dawg@w3.org>
Message-Id: <200904140704.n3E74Huv070387@smtp-vbr12.xs4all.nl>
 


Hi

We might see the rationale for parametrized inference like this:  Since the
web of data is about exchange, integration and unexpected or 
serendipitous repurposing of data, of which most does not originate with the
user or the user's organization, we expect the data to be of 
variable quality and to use variable terms.

There are two  angles for justifying inference control: 1. compensating  for
heterogeneity  and 2. increasing abstraction.
and owl:sameAs at the instance level.

I have at various times blogged against indiscriminately materializing
entailment of data that is likely  of uncertain reliability and can 
be malicious.  

If dealing with web data, like social networks, where the volume can be
large and the reliability so and so, we cannot very well expect the 
publisher of the data set to publish the data with different materialized
entailments of diverse things people might possibly wish to have 
inferred.  For example, the Myspace mapping to RDF was estimated to be 12 bn
triples.  In these cases, we see a  practical justification for 
 leaving what inference to apply up to the query writer, not the data
publisher.  

We have for a long time done basic inference  like subclasses,
subproperties, sameAs and more recently identity of things sharing an IFP
value  at run time.  We have not striven to be complete with respect to RDFS
or any level of OWL but have done what we found straightforward and
efficient.  At some point we will add Prolog style backward chaining to
SQL/SPARQL:  A SPARQL query can mention an n-tuple and have that matched
against rule heads.

In all the cases that we have tried, except for some cases of subclass
membership checking, inference has some cost.  Either it kills working set
by doubling the data volume or it takes doing joins at least in duplicate.
Transitivity also makes query plans  somewhat harder to parallelize.

So, since performance is our primary interest, we would say that the most
important and easiest to agree upon aspect of inference control is a means
of saying that a triple pattern may only be matched by a physically existing
triple.  For big queries where the developer knows the data this can be
useful.  But only for knowledgeable users.

The experience of having the possibility of declaring whether  subclasses
and such will be considered triple by triple is that it is tedious to write
such and think them through, also easy to get wrong.  So in practice, the
declarations are at the query level.  For special cases, these should still
be possible triple by triple.

A lot depends on the use case.  If one has an online application where
simple queries must run fast, then a stored procedure is likely best, so any
inference can be procedurally stated there.   If one has analytics queries
over 100's of G of data, then whether sameAs is considered for matching a
particular pattern can make an order of magnitude of difference.  A join
considering sameAs takes about 5x the time of a join that does not, plus
loss of locality can drive things further out of whack.  Thus, for big
queries to be able to use any inference at all, it will be important to be
able to have fine grain control.  

Furthermore, talking of web data and identities therein, what sameAs will be
believed will in practice have to be qualified further, for example only
sameAs from graphs I say are reliable will enter into the result.  

For data web things, subclasses and subproperties are sometimes nice,
sometimes sameAs.  In practice we have what we find useful but standardizing
this will be quite impossible.  This does not for one thing correspond to
any of the defined entailment  regimes. .  Applications will probably rely
on task specific rules.

Maybe the following extensibility hook is possible:  Some way to say, for
anything delimited  by braces, 
that 1. no inferred triples of any sort will be considered in addition to
physical triples in the evaluation, 2. that inferred triples derived from
inferences described in xx will be considered in addition to physical
triples.  A physical triple is a triple loaded into the store by SPARUL or
an implementation dependent CLI or a means of loading a file.  If this is a
mapped relational database, then triples declared in the mapping will be
considered physical even though they are not in the strict sense.
xx will be a URI whose interpretation is implementation dependent.

If xx refers to an RDFS or OWL ontology,then this ontology itself consists
of triples.  These triples will not be visible to the query, unless these
triples are explicitly loaded into one of the graphs considered for the
query.  Using entailment described in a graph does per se not add the
triples of said graph into the data set of the query.  Xx might also refer
to something that is not represented as triples, such as RIF rules.



Orri
Received on Tuesday, 14 April 2009 07:04:56 UTC