Re: datatyping and query in RDF from Patrick Stickler on 2002-01-30 (www-rdf-comments@w3.org from January to March 2002)

From: Patrick Stickler <patrick.stickler@nokia.com>
Date: Wed, 30 Jan 2002 09:42:01 +0200
To: ext Libby Miller <Libby.Miller@bristol.ac.uk>, RDF Comments <www-rdf-comments@w3.org>
CC: Jeremy Carroll <jjc@hplb.hpl.hp.com>, Brian McBride <bwm@hplb.hpl.hp.com>, ext Graham Klyne <Graham.Klyne@MIMEsweeper.com>
Message-ID: <B87D7069.CA11%patrick.stickler@nokia.com>
On 2002-01-29 21:45, "ext Libby Miller" <Libby.Miller@bristol.ac.uk> wrote:

Thanks very much for your comments and examples, Libby. I found
them very useful in further clarifying the issue regarding how
literals and datatyping interact in queries on the RDF graph.

Some comments/questions for you below...

> In my experience, usually you don't care what the type of a node is;
> sometimes you do, and then you can add the extra constraint.
> 
> This doesn't work in TDL 'global idiom' when the datatyping is only
> mentioned in rdfs:range. If it was in the database somewhere you could
> have
> 
> select ?x ?y ?z
> where
> (?x <dc:Title> ?y)
> (?z <age> ?y)
> 
> and the constraint on ?y from the range would be implicit and would
> happen somewhere in the application code.

Hmmm....  Why not just include the range constraints in the
query? After all, it's knowledge that's in the graph. E.g.

select ?x ?y ?z ?r
where
(?x <dc:Title> ?y)
(?z <age> ?y)
(<dc:Title> <rdfs:range> ?r)
(<age> <rdfs:range> ?r)

The range tests simply ensure that the two datatype
contexts, for <age> and <dc:Title>, have some common
intersection of type which would allow the literal to
have a common value interpretation between them -- i.e.
that ?y would be the same "thing" in both contexts.

This presumes, of course, that you are basing your query
on the values and not simply the string representation
of their lexical forms. Otherwise, why would want an
integer value and a string value to be considered the
same thing? Are you then conducting queries on string
labels in triples rather than the values they represent?

Of course, I would expect that a query API would be
based on an abstraction of the "raw" RDF graph, which
takes datatype context into account, so that a query
such as above would not be based on string comparison
of literals, but on comparison of TDL pairings
(lexical form + datatype). In which case, the range
constraints in the query would be unnecessary, since
the query engine would be trying to bind TDLs to ?y
and not literal strings -- and thus two different
values would not "accidentally" be bound to the same
query variable.

C.f. my example near the end of

http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Jan/0365.html

> In that case I'd be against this idom, because normally I don't rely on
> having the schema (or schema-like constraints) available when you make
> a query, and because I prefer to make explicit queries.

This sounds more like an argument over using local versus global
idioms. This issue would still remain in S, using the S-A global
idiom. Even though S has tidy literal nodes, you still need the
range knowledge in order to interpret values expressed in the S-A
idiom. The tidy literals do not themselves denote anything but
strings, which may or may not be lexical forms in some datatype
context.

Thus, you may think the results of your query, based only on
the literal values, is correct, but may in fact be misleading
as the literals have different interpretations and thus ?y
would not correspond to the same value in each case, only
to the same string.
 
> Also I think that I would probably put the emphasis the opposite
> way to the way Dan C suggests in the 'duh' argument - that is, in the
> absence of typing info I'd make a match. Maybe that would be wrong.

That's an interesting way to look at it. It's sort of like saying
that, if you don't know what a literal's datatype context is, you
can at least do string comparisons between literals.

Though that only equates to reliable comparison of actual values if
global uniqueness is imposed on literals from the application
environment.
 
> The TDL 'local idiom' looks alright to me.
> I guess if there are many possible lexical representations of a given
> literal, then that might make querying more fiddly. At the moment
> tests like ?x > 5 are done by casting to Java datatypes, as Andy does
> with RDQL as well, so matching different lexical representations is
> avoided.

Exactly, and that's what one would be expected to do. Since we
cannot ensure that 

(a) all lexical forms are canonical, nor

(b) that canonical lexical forms share all of the properties
    of the values they denote (sort order, etc.)

therefore we must execute the mapping from lexical form to
application-internalized value in order to compare most values.

The goal of datatyping in RDF, as I see it, is to make sure that
the information needed to execute that mapping is explicit, consistent,
and independent of application context.

Queries directly on the RDF graph (as opposed to some abstraction
above the graph) which include literals will always have to take
datatyping into account if consistently accurate results are to
be obtained.

Queries based solely on string comparison of literals will not
be reliable in a context of syndication of arbitrary knowledge
from many sources.

Cheers,

Patrick

--
               
Patrick Stickler              Phone: +358 50 483 9453
Senior Research Scientist     Fax:   +358 7180 35409
Nokia Research Center         Email: patrick.stickler@nokia.com
Received on Wednesday, 30 January 2002 02:40:55 UTC