Problems with typed vocabulary and rdf:type triples

Hello,

At the F2F, there was a lengthy discussion about the typed vocabulary and rdf:type triples. I was asked to provide more detail about
these problems, as well as to overview the current design decisions, so here it is. 


Typing in RDF parsing
---------------------

OWL 1.1 DL implementations (such as Protégé and DL-based reasoners) typically work at the level of the structural specification, so
they often need to convert an RDF graph into OWL 1.1 (DL) structural specification. Please note that I do not talk here about
parsing RDF files into triples; rather, the "RDF Parsing" problem I talk about in this e-mail is the problem of transforming a set
of RDF triples into objects of the OWL 1.1 structural specification.


The main problem in RDF parsing is as follows. Assume that you encounter in an RDF graph G the following triples:

(1)  <a owl:someValuesFrom b>
(2)  <a owl:onProperty c>

The way you translate this into the structural specification depends on the types of b and c: if b is a class and c is an object
property, then you should create an instance of ObjectSomeValuesFrom; if b is a data range and c is a data property, then you should
create an instance of DataSomeValuesFrom; otherwise, G does not represent a valid OWL (1.1 or 1.0) DL ontology.


Note that these two triples by themselves do not specify the types of b and c. Hence, you need to find other triples in G to be able
to process this fragment. There are several problems with this.

1. Streamed parsing
===================

OWL 1.1 DL applications would like to implement RDF parsing in *streaming mode*: as triples arrive, the parser should transform the
triples into the structural syntax, without keeping them in memory first. This is desirable in order to keep the memory consumption
low: you don't need to store the triples *and* the structural objects in memory at the same time.


2. RDF parsing and imports
==========================

Assume that, in the above example, the typing triples are included into some imported graph G' and not into G directly. This
*hugely* complicates RDF parsing: when parsing G, one cannot work only in the triples from G, but needs to look at G' as well. Note
that there is no requirement in OWL 1.1 that imports should not be cyclic. Hence, you can't really parse your files in sequence: you
need to parse them "all at once".

Admittedly, this problem can be technically solved: you need to make two passes through G and *all imported ontologies*: in the
first pass you accumulate the typing triples, and in the second pass you actually generate the objects. I see, however, quite a few
problems with this.

a. This is inefficient: instead of going through each file only once, we now have to go through each file twice. It is unlikely that
this will improve the image of Semantic Web tools w.r.t. performance.

b. The problem is complicated if an RDF ontology imports an OWL 1.1 DL ontology in some other format. Now you need coordination
among several parsers for different formats.

c. All implementors at the F2F (i.e., Matthew Horridge, Michael Smith, and myself) unanimously agreed that doing this is a *major*
pain. It is easy to dismiss one developer as a whiner; however, if three developers are complaining about this, we should probably
take this seriously. Having an unnecessarily complex specification is likely going to lead to bugs and hassle for the users.




Solutions
---------

I would now like to highlight possible solutions to these problems.

T1. Use typed vocabulary 
========================

We might specify the types of entities directly in the triples. Hence, instead of (1) and (2), we could use triples (3) and (4):

(3)  <a owl:someValuesFrom b>
(4)  <a owl:onObjectProperty c>

Now it is clear that c is a data property, so we do not need the typing triple.

This solution is employed in OWL 1.1, but only if punning is used. Namely, if we are using c for both a data and an object property,
then we cannot assign a single type of c. For entities having exactly one type, the OWL 1.1 specification does not use the typed
vocabulary for backwards compatibility reasoners.

Clearly, we should not immediately switch to the typed vocabulary, as this would wreak havoc on the existing ontologies and systems.
However, going forward, we might want to keep the typed vocabulary and hope to deprecate the untyped triples in the future.




T2. Declare types in the document where an entity is used
=========================================================

Even if we stick with the typed vocabulary to allow for punning and in hope of a migration path, we need in OWL 1.1 a way to handle
untyped statements such as (1) and (2).

The developers of OWL 1.1 DL tools would be *quite happy* if we required the following: if an entity e is used in some RDF graph G
in an axiom, then the graph G must contain an explicit typing triple for e (regardless of the imported ontologies). This would allow
us to parse each RDF graph by itself, without taking into account the imported RDF graphs.



I believe that this is actually compatible with OWL 1.0 DL. In particular, in the Semantics and Abstract Syntax document for OWL 1.0
(http://www.w3.org/TR/owl-semantics/mapping.html), Section 4.1 contains the following mapping:

classID   is mapped to   rdf:type owl:Class . classID rdf:type rdfs:Class . [optional]

Thus, if some classID occurs in some ontology O, then the translation of O into an RDF graph must contain the typing triple for
classID.


Now there is some confusion about what exactly this means. I always interpreted this as "if O is an ontology (not the imports
closure, but just a single ontology), then its conversion into a RDF graph (a single graph which is actually a single file,
regardless of the imports) must contain the typing triple". In other words, the translation is from one *ontology file* to one *RDF
graph file*.

Other people (notably Alan Ruttenberg) interpreted this as "Yes, the graph G needs to contain the typing triple; however, this
triple can be included in some of the importing graphs".

I asked Ian about this, and he said that this point was actually not specified precisely by the specification and that both
interpretations might be OK.



My proposal is to fix the OWL 1.0 specification to say that each typing triple should occur in the very RDF graph that is being
parsed.




T3. Allow typing triples in the imported ontologies
===================================================

Alan Ruttenberg is advocating that we should not replicate typing triples in each RDF file that uses an entity, but should keep them
in the RDF file where the entity is "declared". Well, the notion of "declared" is not quiet clear, but OWL 1.1 already provides for
declaration axioms, so we might say that this means "declared in the OWL 1.1 sense".

The reason why Alan advocates this solution is that he says there should be no propagation of information from the imported to the
importing ontology. For example, you might have an ontology O' in which you have a property P which is declared as a data property.
You import O' into O, and then you make some statements involving P. Then, you decide that P should be an annotation property: you
can now just change the declaration in O'; since the typing triple for P is not repeated in O, everything works fine. If, in
contrast, O was required to contain the typing triple for P as well, you'd need to change this triple as well to make everything
work.



I personally doubt that, apart from this rather simple scenario, things would "work just fine". If you change some entity in such a
fundamental way that you change its type, you should probably go through all the ontologies that are using the changed entity and
make sure that nothing broke. Therefore, I believe that the overhead in RDF parsing and the general complications involved in this
solution are just not worth it.



B. Put typing triples at the beginning of a file
================================================

This solution is orthogonal to T2 and T3; regardless of which solution we pick, we could additionally apply B.

In order to increase the chances that we can parse ontologies in the streaming mode, we might include a note in the RDF
serialization that implementations should preferably put the typing triples at the beginning of each document. Then, a clever
implementation would usually have the typing information ready when it encounters the triples of the form (1) and (2). This is
clearly just a hint; a complete OWL 1.1 implementation should allow for typing triples anywhere in the document. However, if the
typing triples were indeed stored at the beginning of a document, then the implementation might selectively forget parts of the RDF
graph as parsing proceeds.




Regards,

	Boris

Received on Saturday, 15 December 2007 19:47:21 UTC