Re: Can't live with TDL and untidy literals from Patrick Stickler on 2002-01-28 (w3c-rdfcore-wg@w3.org from January 2002)

From: Patrick Stickler <patrick.stickler@nokia.com>
Date: Mon, 28 Jan 2002 12:51:54 +0200
To: ext Sergey Melnik <melnik@db.stanford.edu>, RDF Core <w3c-rdfcore-wg@w3.org>
Message-ID: <B87AF9EA.C6E6%patrick.stickler@nokia.com>
On 2002-01-25 19:22, "ext Sergey Melnik" <melnik@db.stanford.edu> wrote:

> A major problem with TDL [1] is that it requires RDF graphs to be untidy
> on literals. This requirement would breaks most RDF applications that I
> am aware of.

TDL itself doesn't, but I agree that the current TDL MT appears to.

> To illustrate this point, let me give two following examples.
> 
> Example 1: Querying and API access
> ----------------------------------
> 
> Consider the following graph consisting of two RDF statements:
> 
> _1 --dc:Title--> "The Origin of Species"
> _2 --my:book-->  "The Origin of Species"
> 
> Existing applications that assume RDF graphs to be tidy on literals can
> safely conclude that the two literals in the above graph are identical.
> In other words, the query
> 
> (X --dc:Title--> Z) & (Y --my:book--> Z)
> 
> will succeed and return a variable substitution:
> 
> {X=_1, Y=_2, Z="The Origin of Species"}

But is your query based on nodes or labels? I would expect
that it is based on labels -- in which case, it is irrelevant
whether literals are tidy or untidy. In fact, you could
execute such kinds of queries even on a completely untidy
graph, since the scope of your query is a single triple
and tidyness only applies to sets of triples, not individual
triples.

I don't see how TDL, either with tidy or untidy literals,
would break your application.

> In contrast, if literals are considered untidy, such conclusion cannot
> be drawn safely without having access to the schemas that describe the
> properties dc:Title and my:book. In fact, if the schema information for
> dc:Title or my:book is missing, the two literals in the graph have to be
> considered distinct.

I'm sorry, but exactly what information is being provided
by the schema?

> In such case, one or both literals would be
> "untyped", i.e. could potentially have a different interpretation, so
> that their equality does not hold in all valid interpretations.

That is true regardless. If there is no typing expressed in the RDF
graph, then one must rely on the RDF-external application environment
to provide the interpretation.
 
> Consequently, the above query would (have to) fail and produce no
> answer.

I don't see that to be the case, if you are comparing labels
rather than nodes. If you are comparing nodes, then I would
wonder what the benefit of that is. If your graph is tidy, then
you can be assurred that every string-unique literal is the
label of one and only node -- but how does that have anything
to do with the semantics of your query?
 
> Similar issues arise for any kind of API access for RDF graphs. The
> objects or data structures that represent literals in a programming
> language cannot be safely compared without having type information
> attached. 

I agree with you there.

> In other words, the literals would have to carry along the
> properties they are used with and/or the schema class(es) used as the
> range of such properties.
> 
> That is, developers would have to make literals complex objects.

Ummm... well, you have three choices:

1. Define type locally.
2. Define type globally, by some schema.
3. Define type globally, by application environment.

You seem to be taking the third choice, and then demanding
that the interpretation imposed by the environment be
supported by all logical queries on an RDF graph. I don't
find that the least bit reasonable.

If you wish to reliably interchange knowledge between
disparate applications and environments, then all knowledge
must be explicitly defined, either globally or locally,
and thus while choice 3 may work for a single, tightly
controlled application environment, it cannot be the basis
for global interchange of knowledge.

What you may assert some literal means for your application
may not be known by some other application which attempts
to interpret your knowledge (possibly unbeknownst to you)
therefore typing external to RDF is inherently non-portable.

And, yes, local typing does produce complex objects. What else
would you expect?

> Example 2: Storage
> ------------------
> 
> Currently, the storage backends for RDF graphs can benefit substantially
> from the fact that RDF graphs are tidy on literals.

But are current storage backends presently based on tidy literal graphs?

Though, I fully agree that there is significant benefit to be had in
reduced graph real estate if literals are tidy. No argument there.

And, again, TDL has no problems with tidy literals.

> In other words, all
> literals with the same textual content can be replaced by the same
> integer ID, which is then stored as an element of an RDF statement in
> the database. This feature facilitates compact storage of RDF graphs and
> allows efficient query processing.

Wrong. Sorry. Nope. The *only* benefit is a compression of graph nodes.
If you have tidy literals, then queries *cannot* be based on node
equality, only on label equality and such queries are within the context
of a given interpretation.

Thus, per my example illustration in

http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Jan/0314.html

the fact that both statements share the same object node with literal
label "30" does *not* mean that e.g. Jenny has an age equal to
the (foo:count,"30") or that Bob has a jersey number equal to
(xsd:integer,"30").

In other words, tidy literal nodes does not mean that the node itself
denotes a member of a value space of some datatype! It is only a string
which may have meaning as a lexical form for an interpretation within
the context of a given datatype. That is all.

Thus, queries on literal tidy (or untidy) graphs should always be based
on literal label comparison, not on node comparison.

 
> In contrast, having untidy literals would imply in a general case that
> each occurrence of a literal needs to be stored using a different
> integer ID. As a consequence, the database size explodes, and the
> queries become prohibitively expensive.

I agree. And this is one argument in favor of tidy literals, but no
problem for TDL.


> Final remark
> ------------
> 
> As a datatyping proposal, TDL introduces an original idiom for
> representing datatypes that utilizes pairs of lexical tokens and data
> values for representing typed data elements.

Actually, a TDL pairings is a lexical form (literal) and datatype
identity (URI) which unique denotes a datatype mapping, and the
datatype mapping consists of a member of the lexical space and
its corresponding member of the value space.

Thus, reference to values are strictly within the MT, not the
"fundamental" TDL model.

> The document [1] shows how
> this idiom can be deployed *without* requiring RDF graphs to be untidy
> on literals, in a way consistent with the current model theory draft
> [3]. The corresponding idiom in [1] is called Idiom P (or S-P).

OK, I think you are seeing what I have outlined in

http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Jan/0314.html

that TDL itself is not incompatable with tidy literals.


Also, neither of the TDL idioms state whether the graph must
be tidy or untidy for literals, and in fact, like TDL, are
agnostic about that.

Patrick


> -- Sergey
> 
> [1] http://www-nrc.nokia.com/sw/TDL.html
> [2] http://www-db.stanford.edu/~melnik/rdf/datatyping/
> [3] http://www.w3.org/TR/rdf-mt/
> 
> 

--
               
Patrick Stickler              Phone: +358 50 483 9453
Senior Research Scientist     Fax:   +358 7180 35409
Nokia Research Center         Email: patrick.stickler@nokia.com
Received on Monday, 28 January 2002 05:50:57 UTC