Re: Parameterized Inference - starting mail discussion

<chair-hat-off>
Orri, all,

I agree with the observation that full inference (even RDFS) may be 
harmful in the context of Web data, see also [1].

  Unfortunately, I didn't manage to separate the issues on the wiki yet, 
but I suggest, in connection with parameterized inference to put the 
following four items to strawpoll, trying to summarize Bijan's/Andy's 
suggestions:

- ADVERTISE ENTAILMENT: should we work on a mechanism to specify the 
entailment regime supported by an engine (endpoint side parameterized 
inference, i.e. the endpoint be able to specify what entailment it supports)

- REQUEST ENTAILMENT: should we work on a mechanism to request the 
entailment regime in a query (query side side parameterized inference, 
i.e. the requester be able to specify what entailment it expects, Bijan 
seemed to have suggested that the engine may respond falling back to 
another entailment regime, which probably should be indicated in the 
query response)

- SUPPORTED ENTAILMENT REGIMES: should we work on defining a fixed set
   of supported entailment regimes (suggested were: OWL RL, OWL EL, OWL 
QL, OWL DL, "finite RDFS") plus an extensibility mechanism for custom 
entailment regimes (Orri's mail seems to support this, i.e. not all 
inferences wanted in all situations, suggested so far was <rifruleset> 
but maybe even more flexibility is needed)?

- EXTENDED DATASETS: should we work on defining an extended mechanism 
for defining datasets that allows to merge/compose named graphs? This 
relates to paramterized inference because you need to be able to "merge" 
an ontology into a "named data graph" for getting inferred answers.


My strawpoll vote would be +1 for all of these, although I could imagine 
that e.g. SUPPORTED ENTAILMENT REGIMES could go into a note rather than 
Rec track, if that is preferred.
</chair-hat-off>

Axel


Orri Erling wrote:
> Hi
> 
> We might see the rationale for parametrized inference like this:  Since the
> web of data is about exchange, integration and unexpected or 
> serendipitous repurposing of data, of which most does not originate with the
> user or the user's organization, we expect the data to be of 
> variable quality and to use variable terms.
> 
> There are two  angles for justifying inference control: 1. compensating  for
> heterogeneity  and 2. increasing abstraction.
> and owl:sameAs at the instance level.
> 
> I have at various times blogged against indiscriminately materializing
> entailment of data that is likely  of uncertain reliability and can 
> be malicious.  
> 
> If dealing with web data, like social networks, where the volume can be
> large and the reliability so and so, we cannot very well expect the 
> publisher of the data set to publish the data with different materialized
> entailments of diverse things people might possibly wish to have 
> inferred.  For example, the Myspace mapping to RDF was estimated to be 12 bn
> triples.  In these cases, we see a  practical justification for 
>  leaving what inference to apply up to the query writer, not the data
> publisher.  
> 
> We have for a long time done basic inference  like subclasses,
> subproperties, sameAs and more recently identity of things sharing an IFP
> value  at run time.  We have not striven to be complete with respect to RDFS
> or any level of OWL but have done what we found straightforward and
> efficient.  At some point we will add Prolog style backward chaining to
> SQL/SPARQL:  A SPARQL query can mention an n-tuple and have that matched
> against rule heads.
> 
> In all the cases that we have tried, except for some cases of subclass
> membership checking, inference has some cost.  Either it kills working set
> by doubling the data volume or it takes doing joins at least in duplicate.
> Transitivity also makes query plans  somewhat harder to parallelize.
> 
> So, since performance is our primary interest, we would say that the most
> important and easiest to agree upon aspect of inference control is a means
> of saying that a triple pattern may only be matched by a physically existing
> triple.  For big queries where the developer knows the data this can be
> useful.  But only for knowledgeable users.
> 
> The experience of having the possibility of declaring whether  subclasses
> and such will be considered triple by triple is that it is tedious to write
> such and think them through, also easy to get wrong.  So in practice, the
> declarations are at the query level.  For special cases, these should still
> be possible triple by triple.
> 
> A lot depends on the use case.  If one has an online application where
> simple queries must run fast, then a stored procedure is likely best, so any
> inference can be procedurally stated there.   If one has analytics queries
> over 100's of G of data, then whether sameAs is considered for matching a
> particular pattern can make an order of magnitude of difference.  A join
> considering sameAs takes about 5x the time of a join that does not, plus
> loss of locality can drive things further out of whack.  Thus, for big
> queries to be able to use any inference at all, it will be important to be
> able to have fine grain control.  
> 
> Furthermore, talking of web data and identities therein, what sameAs will be
> believed will in practice have to be qualified further, for example only
> sameAs from graphs I say are reliable will enter into the result.  
> 
> For data web things, subclasses and subproperties are sometimes nice,
> sometimes sameAs.  In practice we have what we find useful but standardizing
> this will be quite impossible.  This does not for one thing correspond to
> any of the defined entailment  regimes. .  Applications will probably rely
> on task specific rules.
> 
> Maybe the following extensibility hook is possible:  Some way to say, for
> anything delimited  by braces, 
> that 1. no inferred triples of any sort will be considered in addition to
> physical triples in the evaluation, 2. that inferred triples derived from
> inferences described in xx will be considered in addition to physical
> triples.  A physical triple is a triple loaded into the store by SPARUL or
> an implementation dependent CLI or a means of loading a file.  If this is a
> mapped relational database, then triples declared in the mapping will be
> considered physical even though they are not in the strict sense.
> xx will be a URI whose interpretation is implementation dependent.
> 
> If xx refers to an RDFS or OWL ontology,then this ontology itself consists
> of triples.  These triples will not be visible to the query, unless these
> triples are explicitly loaded into one of the graphs considered for the
> query.  Using entailment described in a graph does per se not add the
> triples of said graph into the data set of the query.  Xx might also refer
> to something that is not represented as triples, such as RIF rules.
> 
> 
> 
> Orri
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


-- 
Dr. Axel Polleres
Digital Enterprise Research Institute, National University of Ireland, 
Galway
email: axel.polleres@deri.org  url: http://www.polleres.net/

Received on Tuesday, 14 April 2009 08:14:23 UTC