Re: relational data as a bona fide member of the SM from Markus Krötzsch on 2011-11-10 (semantic-web@w3.org from November 2011)

From: Markus Krötzsch <markus.kroetzsch@cs.ox.ac.uk>
Date: Thu, 10 Nov 2011 10:59:41 +0000
To: Sampo Syreeni <decoy@iki.fi>
CC: Alexandre Riazanov <alexandre.riazanov@gmail.com>, Semantic Web List <semantic-web@w3.org>
Message-ID: <4EBBAE9D.3040007@cs.ox.ac.uk>
Dear Sampo: some quite interesting remarks that call for a reply .. see 
inline.

On 09/11/11 19:15, Sampo Syreeni wrote:
> On 2011-11-03, Markus Krötzsch wrote:
>
>> It is true that DL (and thus OWL DL) is conceptually based on
>> unary/binary relations but this was hardly the historical reason for
>> RDF being defined this way in the first place.
>>
>> However, somewhat ironically, OWL ontologies are a good example of
>> pieces of data that suffer a lot from forced triplification. [...]
>
> I don't believe this is a theoretical obstacle to either of "triplified"
> reifications of general n-ary data, or alternatively the n-ary
> representations of triples. That's because I believe there is a natural
> and well-defined isomorphism (perhaps a pair of functors) between the
> two models, which fully retains all of the logical aspects of the model.
>
> If and when so, you then just prove whatever you want to on either side,
> and it applies on the other by automation. After that, the question of
> which data model is the best reduces to questions about programmer
> productivity, maintainability of shared interfaces, and the efficiency
> of actual (physical as well) implementations. Not a logical problem, but
> a practical one.
>
> True, realizing and proving such an equivalence is a hard task, since it
> obviously requires someone to restrict what is done on the triple side
> to what can be done in its image on the n-ary, relational one. But
> really, it shouldn't *that* hard, and once you've done the job, you
> suddenly have two rather different and possibly complementary theories
> which bear on the same problem: relational design principles like the
> theory of normalization on the one hand, and then the hard logical
> theory of description logics and the like on the other.
>
> Personally I'm reasonably sure that attacking the logical problems
> starting from both fronts at the same time will lead to more fruitful
> and easier to prove theorems in the long run. At the same time, having a
> single, well-settled isomorphism between the two models in place would
> grant a lot more leeway for API, storage, optimizer and like builders to
> find the optimum balance e.g. between how to access the traditional
> OLTP/OLAP-like databases, and the more involved, deductive ones. In my
> mind, this sort of thinking leads to a clearer separation of the levels
> of abstraction, the way my favourite relational model tried to do from
> the start, but at the same time extends to semi-structured data.

Let me first comment on this observation which I would summarise as "The 
information of any n-ary relational structure can be captured by a 
triple-based structure, and it should be no theoretical obstacle to 
translate between the two."

In principle, this is true, and it is what is already done: there is a 
standard translation between the n-ary OWL data model and the 
triple-based RDF data model [1]. This is (one possible version of) an 
isomorphisms as you suggest it. So this is solved.

Unfortunately, this does not really solve the underlying problem. You 
are fine as long as you have an OWL ontology with n-ary statements. This 
is easy to translate into triples and these triples can be used in place 
of the original by tools. As you say, this is just an implementation 
issue and may actually add more freedom to tool design.

But not every set of triples can be translated back into OWL. Since RDF 
is the main exchange format on the data web, you can find a lot of 
OWL-like RDF documents online that do not translate back into OWL 
axioms. To address this, it was necessary to develop an alternative, 
RDF-Based Semantics for interpreting OWL. This semantics is tolerant to 
noise of all kinds, but there is no algorithm for finding all entailed 
inferences (i.e., there cannot ever be one for principled reasons). 
Moreover, the RDF-Based Semantics only partially agrees with the 
DL-bases "Direct Semantics" which is not so good for interoperability.

Summing up, OWL is based on n-ary axioms that are inspired by features 
in Description Logics, but since these axioms are decomposed and mixed 
up on the Web, it is often not possible to translate them back into 
axioms to which DL methods would be applicable. This also affects tool 
interoperability on an API level, since only tools that are based on 
triple decompositions of axioms can be sure to process any OWL document 
without loosing information. If axioms were encoded as the n-ary 
statements that people originally entered when editing the ontologies in 
their editors, then many of the reasons for not being able to apply the 
Direct Semantics would vanish (not all, but many).

>
>> OWL has a native (functional style) syntax that is quite easy to
>> parse, whereas its RDF serialisation requires multiple passes over the
>> data to group triples that belong to the same axioms (because the
>> triples that form a single OWL statement can be distributed over a
>> whole file, in random order).
>
> Extending the above analogy of mine, that might suggest a binary RDF
> serialization which groups and orders triples for more efficient
> (semi-)serial computation and communication. But it certainly does not
> affect the logical quality of the overall theory we're dealing with.
> Thus, decoupling of different levels of abstraction:
> storage/transmission/processing on the one hand, and the logical
> underpinnings on the other.

Yes, this would work if all data that was stored would make sense on all 
levels of abstraction. But on the Web, a transmission format that does 
not syntactically enforce that the data is meaningful on higher levels 
will always lead to data where this is not the case, effectively forcing 
all tools to work at the lowest level of representation and losing the 
hoped-for independence between processing and encoding.

I do not claim to have a solution ready for solving this, since the 
interoperability issues with a relational model are not necessarily 
smaller (e.g., how do you enforce that a relation has a constant arity 
across all its uses on the Web?). Maybe the best way is to advertise a 
higher consciousness of data quality on the data producer side, which is 
the mission of the Pedantic Web group [2].

>
>> So one can actually say that OWL users, while preferring to model
>> information in a *semantic* world of binary relations, are not very
>> well served with a *syntax* that requires n-ary statements to be
>> encoded in triples which do not allow have a reasonable meaning unless
>> they can be re-assembled appropriately.
>
> Precisely so. And this is again one thing the relational world learned
> long before RDF came along: semantics and syntax should be decoupled,
> but then once you start to implement stuff for real, the syntax must be
> adviced by the semantics, unless we want implementations with unbounded
> buffers, unnecessary sorting/merging/joining and so on. This is all
> covered within the relational literature, even in the distributed DB
> plus distributed DBMS setting. Thus, what *I* think we need is a
> clearcut isomorphism from the triple/EAV model to the relational, n-ary
> one, and then just wholesale application of knowledge via that
> isomorphism in both ways.

For OWL, this isomorphisms is [1] but only for a subset of RDF. For 
general RDB models, there are various efforts to achieve something 
similar, again for only a subset of RDF.

Such isomorphisms do not work well in all contexts, especially not in 
situations where new data is created: to create a new 4-ary relation in 
triples, e.g., one needs to create new *individual objects*, i.e., one 
has to add to the active domain of the database. This can be a problem 
(for one thing, if you do this recursively then you don't know if it 
will ever stop). We have proved recently that there are Semantic Web 
related tasks where ternary relations (triples) are not sufficient to 
compute inferences, even if the inferences are triples [3] (again, this 
is a principled result: no algorithm [of the general kind considered in 
the paper] that uses only triples without inventing new individuals can 
ever solve this problem).

Markus


[1] http://www.w3.org/TR/owl2-mapping-to-rdf/
[2] http://pedantic-web.org/
[3] http://korrekt.org/page/Efficient_Rule-Based_Inferencing_for_OWL_EL


-- 
Dr. Markus Krötzsch
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529               http://korrekt.org/
Received on Thursday, 10 November 2011 11:00:07 UTC