Potential performance problems with easy keys from Boris Motik on 2008-06-02 (public-owl-wg@w3.org from June 2008)

From: Boris Motik <boris.motik@comlab.ox.ac.uk>
Date: Mon, 2 Jun 2008 20:04:28 +0100
To: <public-owl-wg@w3.org>
Message-ID: <004c01c8c4e3$7ba4fe50$7212a8c0@wolf>
Hello,

As promised at the last teleconf, here is a description of potential performance problems with easy keys. I apologize in advance for
a rather long e-mail.

I'll explain everything by an example of an inverse-functional datatype property R, which is the simplest form of easy keys.
Logically, the statement that R is inverse-functional corresponds to the following implication (all variables are implicitly
universally quantified):

(1) R(x_1,y) & R(x_2,y) & HU(x_1) & HU(x_2) -> x_1 = x_2

Here, HU is the Herbrand universe -- that is, it is the predicate containing all constants explicitly occurring in the knowledge
base. I'll omit the assertions involving this predicate for brevity.



1. Where does the problem originate from?
-----------------------------------------

Before I can explain where the potential performance problems are, I shall first present an example that shows why reasoning with
(1) is not straightforward.

Let KB be the following knowledge base containing (1) and the following assertions (I'm using here the description logics notation
for brevity):

(2) R(a,"5"^^xsd:integer)
(3) \exists R.=_5(b)

In (2), "5"^^xsd:integer is an integer constant with the value five; in (2), =_5 is a data range that has the value as the singleton
5. Such a data range can be constructed using the oneOf construct of OWL 2 or using a suitable DatatypeRestriction.

Now consider how a tableau algorithm might reason with KB. In (3), we have the existential quantifier, so a tableau algorithm will
introduce a fresh individual c and expand (3) into assertions (4)--(5):

(4) R(b,c)
(5) =_5(c)

Note that, due to (5), c is semantically equal to "5"^^xsd:integer. Therefore, rule (1) should "fire" and we should derive that a
and b are the same objects.

Since c is syntactically different from the constant "5"^^xsd:integer, however, it is not obvious that the rule should "fire": there
is no way to match the variables from (1) to individuals and constants in (2)--(5) such that the body of the rule is satisfied.

Tableau algorithms need to deal with this problem if they were to be sound and complete. I'll next describe two ways of handling
this problem and discuss the benefits and drawbacks of each of them.

Note also that tableau algorithms usually use a "datatype oracle" for reasoning with datatypes -- that is, they assume that there is
some procedure that can detect inconsistency in the datatype domain. I'll use such an oracle in my examples.



2. Solution 1: Modify the Rule (1)
----------------------------------

The first solution to this problem is to modify the rule (1) such that we *can* match the variables in to syntactically different
but semantically equal constants. Therefore, we first transform (1) into (6):

(6) R(x_1,y_1) & R(x_2,y_2) & HU(x_1) & HU(x_2) & y_1 = y_2 -> x_1 = x_2

The equality y_1 = y_2 in (6) is still difficult to handle because it occurs in the body of (6); however, by the standard
first-order equivalences, we can transform (6) into the following rule:

(7) R(x_1,y_1) & R(x_2,y_2) & HU(x_1) & HU(x_2) -> x_1 = x_2  |  y_1 != y_2

Here, | is disjunction and != is inequality. This transformation now solves our problem. We can apply (7) to (2) and (4) and derive
the following disjunction:

(8) a = b  |  "5"^^xsd:integer != c 

To deal with this disjunction, we need to guess which disjunct is true. If we guess that "5"^^xsd:integer != c is true, we get a
contradiction due to (5); this can be detected by standard datatype oracles. Therefore, we need to backtrack and guess a = b; well,
this is exactly what we wanted, so our problem is solved.



2.1 Performance Problems
------------------------

The problem with (7) is that the rule consists of two parts -- R(x_1,y_1) & HU(x_1) and R(x_2,y_2) & HU(x_2) -- in which the
variables are completely independent. Thus, if we have n assertions in our ABox involving the role R, this rule derives n^2
disjunctions. This might be not a good idea for at least two reasons:

- If n is large, n^2 is even larger. One might object and say that many rules in OWL require a potentially quadratic number of
inferences. For example, functionality of an object property S is axiomatized using the following rule:

(9) S(x,y_1) & S(x,y_2) -> y_1 = y_2

This rule is also quadratic in the worst case; however, this is rarely a problem in practice. Since the two atoms in the body of (9)
share the variable x, any reasonable implementation would match the rule in the following way: if would sequentially pass through
all the assertion for S to match the first atom, and then, for each matched assertion, it would use the matched value of x to look
up the potential matches of the second atom by means of an index. Usually, there are not all that many assertions with S that share
the individual in the first position, so this algorithm is "almost" linear. In (7), however, there is no way to use the index
because the two parts are completely disjoint in variables. Therefore, we *must* go through the whole extension of R twice, so we
are doomed to a quadratic number of inferences.

- A quadratic number of disjunctions might lead to quite a bit of backtracking.


2.2 Why not all is lost
-----------------------

It might be possible to devise ways of implementing this efficiently. In particular, whenever y_1 and y_2 in (7) are matched to
constants, we can determine the value of y_1 != y_2 in constant time. Thus, we can eliminate quite a few disjunctions, and only
disjunctions containing disjuncts of the form <constant> != <fresh individual> and <fresh individual> != <fresh individual> will
"survive".

Furthermore, the quadratic blowup in the rule applications might be alleviated by keeping the constants normalized and then matching
the rules in some smart way.

Still, if there are quite a few fresh individuals introduced by the existential quantifier, one might be left with quite a few
disjunctions. Hence, it is not clear to me immediately that this will "just work".



3. Solution 2: Modify the Datatype Oracle
-----------------------------------------

Approaches to dealing with this problem have been proposed in which the oracle is modified such that it generates, whenever
possible, all the necessary equalities. Thus, by looking at (2)--(5), such an oracle would generate the following assertion:

(10) "5"^^xsd:integer = c

Then, the tableau algorithm would replace one individual with the other, which would then enable the application of the rule (1).


3.1 Performance Problems
------------------------

The problem with this solution is that the existing formulations of the interface between the tableau systems and the datatype
oracle work with quite large search spaces. In general, the oracle is required to return ALL POSSIBLE WAYS to equate datatype
individuals and satisfy the datatype constraints.

Thus, given an ABox consisting of n datatype assertions of the form R(a,c_i) where c_i is a datatype individual, there are
potentially 2^{n^2} ways of equating different individuals. If some way of equating the individuals fails, we need to backtrack and
try another way of doing this, which might take up quite a bit of time.

This search space might be reduced. In particular, it may be possible to show that only *minimal* ways of equating individuals must
be considered. It is, however, unclear to me yet how and whether this can be implemented efficiently.



4. What to do?
--------------

The problems I outlined here might significantly impair scalability of OWL reasoners. Therefore, it might be a good idea to first
implement easy keys and see how they fare on reasonably sized ABoxes. After all, this was the principle that we applied to all other
features in OWL 2.


Regards,

	Boris
Received on Monday, 2 June 2008 19:06:02 UTC