Re: Implementations of LCS for OWL from Chris Mungall on 2010-05-04 (public-owl-dev@w3.org from April to June 2010)

From: Chris Mungall <cjm@berkeleybop.org>
Date: Tue, 4 May 2010 09:06:37 -0700
To: Phillip Lord <phillip.lord@newcastle.ac.uk>
Cc: Alan Rector <rector@cs.man.ac.uk>, Owl Dev <public-owl-dev@w3.org>, sonic@tcs.inf.tu-dresden.de
Message-Id: <B37B7F6A-FA48-4940-9C2D-8743D8C16EAD@berkeleybop.org>

On May 4, 2010, at 7:23 AM, Phillip Lord wrote:

> Chris Mungall <cjm@berkeleybop.org> writes:
>>> It seems reasonable to me to assume that at the time you want to
>>> calculate a semantic similarity, then you have all the three terms  
>>> that
>>> you want -- the two that you wish to compare, and the (unknown,
>>> explicitly expressed in the ontology) term that is the LCS.
>>
>> With some knowledge bases that is a reasonable assumption; in other  
>> cases
>> there may be a limited amount of pre-composition or the pre-
>> composition may be fairly ad-hoc, and allowing class expressions in  
>> the LCS
>> results will give you something more specific and informative.
>
>
> Yes, but you don't need the results to be a class expression in this
> case. You just need the queries to support class expressions, which  
> is a
> different kettle of fish. This means that you can avoid the  
> nastiness of
> "I want LCS to support class expressions except for the ones that I
> don't want like A or B".

In some cases you do want CEs in the LCS for additional precision.  
Consider:

car = vehicle and hasPart exactly 4 wheel and hasPart some motor
motorbike = vehicle and hasPart exactly 2 wheel and hasPart some motor
bicycle = vehicle and hasPart exactly 2 wheel

named_lcs(car,motorbike) = vehicle
lcs(car,motorbike) = vehicle and hasPart some wheel and hasPart some  
motor

named_lcs(bicycle,motorbike) = vehicle
lcs(bicycle,motorbike) = vehicle and hasPart exactly 2 wheel

named_lcs(bicycle,car) = vehicle
lcs(bicycle,car) = vehicle and hasPart exactly some wheel

Of course, if intermediate classes such as "2 wheeled vehicle",  
"wheeled vehicle", "motorized vehicle" etc are declared in advance  
then returning named classes gives the equivalent level of precision.  
One strategy would be to enumerate all combinations of CEs, feed that  
to the reasoner, and then use standard LCS techniques. But (a) this is  
only feasible for certain subsets of OWL-DL and (b) this is presumably  
less efficient than techniques that integrate the LCS calculation with  
reasoning.

>>> I can see a very strong use case why you might want to allow the  
>>> query
>>> terms to not pre-exist, but why the LCS? What semantic similarity
>>> measures were you thinking of anyway? The information content based
>>> ones will, I think, require that the LCS pre-exist anyway.
>>
>> I don't think that need be the case. Calculating the IC requires  
>> finding the
>> cardinality of the extent of the LCS, and this can be done   
>> trivially using
>> any OWL reasoner. Of course, there is a closed world  assumption  
>> here but this
>> is built into any IC calculation (the well  known literature bias).
>
> Trivial but slow, as far as I can see. If your corpus is large, then
> have to query against all members of the corpus (ie the instances). In
> this case, it's worse. The "least" in LCS is defined by the corpus.  
> So,
> if there are a number of different LCSs which are sibs, then you  
> have to
> test the information content of them all.

If the lcs function can return a CE that uses intersection then there  
is a maximum of 1 LCS. If the LCS did contain n>1 CEs e1, ..., en then  
you could just create a CE <e1 and e2 and ... en> which is more  
specific. Proof from the following paper:

   title={{Computing least common subsumers in description logics}},
   author={Cohen, W.W. and Borgida, A. and Hirsh, H.},
   booktitle={Proceedings of the National Conference on Artificial  
Intelligence},
   pages={754--754},
   year={1992},

But I think you're right in that in practical terms this could be  
quite slow because current reasoner APIs don't AFAIK allow queries of  
the form "how many instances satisfy the CE <X>", instead it's  
necessary to query for the CE, and then count the size of the  
resulting set of objects which could be slow for large KBs.

> Phil
>
>

Received on Tuesday, 4 May 2010 16:07:12 UTC