Re: Datatyping Summary from Jeremy Carroll on 2002-01-30 (w3c-rdfcore-wg@w3.org from January 2002)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Wed, 30 Jan 2002 09:38:14 -0000
To: <w3c-rdfcore-wg@w3.org>
Message-ID: <CEECKEAMDAJDDEDGJNBEMEBGCAAA.jjc@hpl.hp.com>
[More on query also a substantial clarification of why I "can't live with"
S-B. It encourages unsafe type processing within the application layer].

Lets look at the example query first:

[[[
   _:f <dc:Title> "10" .
   <mary> <age> "10" .

Given a query:

   (?x <dc:Title> ?y) & (?z <age> ?y)

existing applications will return:

   ?x = _:f, ?y = "10", ?z = <mary>

]]]

RDF Query is of course, still an active research area, rather than one where
there is any stable deployed code base. (There is deployed code, but it is
in development).

Hence, discussion about query semantics would perhaps be better placed on
rdf-query, but ...


Under S-B (the relevant idiom here) and RDF M&S, the only possible meaning
of whether two literal nodes are equal is whether their labels are equal.

Suppose RDF M&S and S-B are read untidily then each distinct triple with a
literal as object has a distinct literal as object. There is no mechanism
for indicating that two literals are the same or different except by their
label.

Since the query is asking us to compare two literal nodes, under S-B or RDF
M&S there is only one possibility, compare their labels. Both the new
(untidy) model theory and TDL suggest a second possibility, that of
comparing the values associated with the literal nodes. Neither rule out the
old possibility, they simply permit a new possibility. It is deceptive to
suggest something is not backwardly compatible simply because it offers a
better alternative, while allowing the old deprecated approach.


===

In terms of the syntactic transformation mapping TDL into S-P [1] the data
in this example becomes:

   _:f <dc:Title> _:t .
   _:t <rdf:value> "10" .
   <mary> <age> _:a .
   _:a <rdf:value> "10" .

(I use this transformation for explanatory purposes, I would not implement
TDL like this).

Now, both _:t and _:a can take possible values including < "10", "10" >
(string reading) and < "10", 10 > (numeric reading).
Hence the query as originally formulated is now false. However, if we wish
to execute the query in a backwards compatibility mode, we would expect it
to compare literal nodes using their string labels, rather than their
values.

Hence, the query, in this S-P style, would be equivalent to:


   [ (?x <dc:Title> ?t) & (?t <rdf:value> ?y)
    & (?z <age> ?a) & (?a <rdf:value> ?y) ]  // checking literal label
equality
|
   [ (?x <dc:Title> ?y) & (?z <age> ?y) ]  // checking non-literal node
equality

This could all happen in response to dispatching node equality method on the
basis of the type of the nodes being compared, and on the presence or not of
a backwards compatibility flag.

====

Now Patrick has argued that comparing on values is correct.
Dan and Sergey argue that comparing on labels is correct (described by them
using tidiness).

TDL allows clarity about this distinction, and allows query researchers to
explore both possibilities.

====

This framework allows me to illustrate an aspect of my "can't live" issue
with S-B.

S-B allows range constraints, in this example perhaps:

<dc:title> <rdfs:range> <xsd:string.lex> .
<age> <rdfs:range> <xsd:integer.lex> .

I currently understand S-B as, within the RDF datatyping layer, insisting
that "10" is a string.
The two range constraints are used to:
- constrain the set of possible strings
- act as a hint to the application layer that:
   * type conversion is possible
   * type conversion is desirable.

Thus given the database and the schema the application processing will
correctly treat the title as a string, and the age as an integer. Good.

Now, the query also operates in the application layer.
This returns true.

Thus in the application layer we have the following facts being the case:
  The film has the title "10".
  mary has age 10.
  The age of mary is the title of the film.
   i.e. that 10 is "10"

There is a type clash here, and the combination is a logical error.

Thus, S-B maintains a theoretical purity by pushing all typing problems into
the application layer. Moreover it apparantly licenses the unwary
application developer into during contradictory conclusions.

So, S-B is seriously flawed in that it does not assist the application
developer to avoid logical errors associated with datatyping.

=====

Let's now explore this same error from the point of TDL.
TDL requires a clarification of the query, are we speaking about the values,
or are we speaking about the lexical forms.

If we are talking about the values, then the query is only true if we have
compatible type declarations for both literal nodes. Since we are not, we
cannot conclude that
  The age of mary is the title of the film.
hence we avoid the error.

If on the other hand we are talking about the lexicalizations we have:
   The age of mary as a string is the title of the film as a string.
   i.e. 10 as a string is "10" as a string

So TDL assists the application developer in being logically correct.



----

Brian, I would be very much obliged if you can condense this example to add
to your summary list of concrete "can't live with" issues on the proposals.
My title would be "S-B encourages logically errors in the application type
processing."

PS. Sergey attacked Patrick's list of "can't live with" issues as largely
advocacy of TDL; I do tend to feel that an aspect of "can't live with" is
comparative. Logical errors happen, no framework prevents all logical
errors. Choosing a framework that is logically unsafe when an alternative
safe framework is available is folly. However, modifications to S-B that
assisted the application to make logically safe type inferences would be a
substantial improvement that would address one of my major concerns.



Jeremy


[1]Jeremy Carroll: Re: Datatyping Summary
http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Jan/0369.html
Received on Wednesday, 30 January 2002 04:38:52 UTC