Can't live with TDL and untidy literals

A major problem with TDL [1] is that it requires RDF graphs to be untidy
on literals. This requirement would breaks most RDF applications that I
am aware of.

To illustrate this point, let me give two following examples.

Example 1: Querying and API access
----------------------------------

Consider the following graph consisting of two RDF statements:

_1 --dc:Title--> "The Origin of Species"
_2 --my:book-->  "The Origin of Species"

Existing applications that assume RDF graphs to be tidy on literals can
safely conclude that the two literals in the above graph are identical.
In other words, the query

(X --dc:Title--> Z) & (Y --my:book--> Z)

will succeed and return a variable substitution:

{X=_1, Y=_2, Z="The Origin of Species"}

In contrast, if literals are considered untidy, such conclusion cannot
be drawn safely without having access to the schemas that describe the
properties dc:Title and my:book. In fact, if the schema information for
dc:Title or my:book is missing, the two literals in the graph have to be
considered distinct. In such case, one or both literals would be
"untyped", i.e. could potentially have a different interpretation, so
that their equality does not hold in all valid interpretations.

Consequently, the above query would (have to) fail and produce no
answer.

Similar issues arise for any kind of API access for RDF graphs. The
objects or data structures that represent literals in a programming
language cannot be safely compared without having type information
attached. In other words, the literals would have to carry along the
properties they are used with and/or the schema class(es) used as the
range of such properties.

That is, developers would have to make literals complex objects.

Example 2: Storage
------------------

Currently, the storage backends for RDF graphs can benefit substantially
from the fact that RDF graphs are tidy on literals. In other words, all
literals with the same textual content can be replaced by the same
integer ID, which is then stored as an element of an RDF statement in
the database. This feature facilitates compact storage of RDF graphs and
allows efficient query processing.

In contrast, having untidy literals would imply in a general case that
each occurrence of a literal needs to be stored using a different
integer ID. As a consequence, the database size explodes, and the
queries become prohibitively expensive.

Final remark
------------

As a datatyping proposal, TDL introduces an original idiom for
representing datatypes that utilizes pairs of lexical tokens and data
values for representing typed data elements. The document [1] shows how
this idiom can be deployed *without* requiring RDF graphs to be untidy
on literals, in a way consistent with the current model theory draft
[3]. The corresponding idiom in [1] is called Idiom P (or S-P).

-- Sergey

[1] http://www-nrc.nokia.com/sw/TDL.html
[2] http://www-db.stanford.edu/~melnik/rdf/datatyping/
[3] http://www.w3.org/TR/rdf-mt/

Received on Friday, 25 January 2002 11:52:52 UTC