datatypes and MT from Pat Hayes on 2001-11-02 (w3c-rdfcore-wg@w3.org from November 2001)

From: Pat Hayes <phayes@ai.uwf.edu>
Date: Thu, 1 Nov 2001 18:31:11 -0600
To: w3c-rdfcore-wg@w3.org, pfps@research.bell-labs.com
Message-Id: <p0510101eb8074f2b836b@[205.160.76.193]>
I'm sorry I dropped the ball on this issue just as it heated up. Ive 
been (literally) laid low with flu since Friday and unable to do 
anything much except whimper and cough.

Let me try to first summarize the MT changes that I managed to 
extract from the pfps/ph interchange - I think these are all pretty 
much the same as what Peter got out of it, in all but stylistic 
details - and then use this extended MT to respond to a lot of the 
recent comments, some of which I think are based on 
misunderstandings. I will also try to show how *all* the various 
proposals for encoding datatyping information in RDF syntax can be 
incorporated into this MT extension in a uniform way, and so can a 
few others that havn't been made yet.

OK. The basic idea of the MT extension arises from the fact that 
while the same literal label might mean different things in different 
contexts, datatyping information seems to remove what would otherwise 
be an 'ambiguity' when one looks at particular occurrences of literal 
labels. For example, it seems perfectly clear, just speaking 
intuitively, that the following graph

aaa rdfs:range xsd:integer .
bbb rdfs:range xsd:string .
foo aaa "101001" .
baz bbb "101001" .

can be unambiguously interpreted as saying that the aaa-value of foo 
is the integer (89456 + 11545) and the bbb-value of baz is the 
character string "101001". (The double quotes in the RDF triples are 
not to be interpreted as actual quotation marks, of course, but only 
as literal-markers, a convention which I abhor but will stick with 
throughout this message.) The fact that there is no ambiguity in this 
graph, even though the same syntactic label is used with two 
different meanings, suggests that the datatyping information should 
not be thought of as a mapping from the literal labels - which is 
what one would get by extending the MT in a very straightforward way 
by simply including literal labels into the vocabulary of an 
interpretation - but rather as attached to the node itself. The 
datatyping extension to the MT therefore introduces a new kind of 
mapping, which is very similar to an interpretation mapping, but 
which applies to the nodes of the graph rather than to the literals 
which are used to label those nodes. We might call such a thing a 
'datatyping interpretation', but I think that might be confusing, so 
I will call it a 'typing'.  An interpretation is a mapping from a URI 
vocabulary to entities; a typing is a mapping from nodes of the graph 
to datatypes. (I'll tell you what a datatype is in a minute.)

The current MT does need to be altered in a tiny degree to make this 
possible. Currently it says that XL is a global mapping from literals 
to LV, and then it uses that mapping once, and never mentions it 
again. We need to change that so we define XL to be a mapping from 
*occurrences* (tokens, inscriptions, whatever) of literals to LV; or, 
less mysteriously, from *literal nodes* to LV. Then the model theory 
works exactly as before; this mapping is still 'global' (though that 
is now perhaps an unfortunate word to use) in the sense that it is 
independent of the interpretation.  (When we introduce a datatyping 
scheme, however, there is some connection between them, as we will 
see.)

Technicalities.
A datatype is a mapping from a lexical domain (a subset of literals) 
to a range of values (a set). A datatype scheme is a set DT plus a 
fixed mapping DTS from DT to datatypes.  It is convenient to define 
DTC(x) to be the range (not rdfs:Range, just range) of the datatype 
DTS(x). If nnn is the URI of a type, we will write XD(nnn) for the 
member of DT that the uri identifies. This mapping XD is similar in 
some ways to an interpretation mapping I, but unlike I, it is assumed 
that XD is computable (in some way), ie given the uri of a datatyping 
scheme, the machine can somehow access the actual DTS and DTC 
mappings associated with that datatyping scheme and apply them.

(DTS and  DT aren't *strictly* needed, in fact; one could have a 
single global mapping directly from urirefs to datatypes; but it is 
in the same spirit as the rest of the model theory so I will go on 
doing it this way. Think of the member of DT as the abstract 
datatype-thingie and DTS as the datatype version of the IEXT mapping 
in the MT.  DTC is analogous to ICEXT.)

Now, a datatyping of a graph is simply a mapping from the literal 
nodes of the graph to a datatype scheme, ie an assignment of a type 
to each literal node of the graph. Of course, a graph by itself might 
have any number of datatypings, just as a vocabulary can have any 
number of interpretations, but we want to make sure that if we have 
one of each than that fixes the interpretation of every part of a 
labelled graph unambiguously, by placing mutual constraints on how 
they act together. So, in this spirit, a typed interpretation is a 
pair <I,D> of an interpretation (of the uri vocabulary) and a 
datyping D (of the literal nodes of the graph) which together satisfy 
the following constraints:

1. XL(n) = DTS(D(n))(label(n))
2. ICEXT(d) is a subset of DTC(d) for any d in (DT intersect IR)

(In 1 we are using two rather different kinds of function; label(n) 
is just a kind of syntactic selection function which refers to the 
label on the node n, while the other mappings are semantic. 
Mathematically these are all just functions, but intuitively they 
have different roles.)

What these mean, intuitively, is exactly what one would expect; that 
the value of a literal (in the interpretation) is understood to be 
determined by the datatyping of the node on which it occurs; and that 
datatypes are treated appropriately as rdfs classes, in the sense 
that the RDFS class of a given datatype in the interpretation is a 
subset of set of things which actually have that datatype according 
to D.

Now, the only other semantic condition we need on an interpretation 
is that it 'understands' that the urirefs that denote datatypes are 
in fact interpreted to refer to those datatypes, which can be stated 
as the condition

3. I(nnn)= XD(nnn)

If I satisfies condition 3 for the datatype named nnn then we will 
say that I 'recognizes' that datatype, and it is natural to require 
that I recognizes every datatype which is mentioned in the graph (ie 
where nnn occurs in the graph as a node label), so we will add this 
as a third condition on a typed interpretation <I,D>.

OK, that is all we require. We could sum all this up by saying that I 
has to 'respect'  D by agreeing to use the datatype URIs in the 
appropriate ways, and by agreeing to interpret the datatype mappings 
as rdf:properties in a consistent way (consistent with D, that is.) 
In return, D will undertake to guarantee that all literal occurrences 
are interpreted in a way that will be consistent with anything that 
is said in the RDF triples. As long as I and D promise to keep to 
their respective vows concerning each other, we can be sure that 
their marriage will be happy.

(A less anthropomorphic analogy might be to think of D as a kind of 
datatyping network between nodes along which datatyping information 
can flow, and which can carry different information to different 
nodes.)

This isn't really very different from extending the interpretation 
idea from urirefs to literals, but it acknowledges explicitly the 
ways that literal evaluation differs from uriref interpretations; the 
fact that once the datatyping has been determined, the value of the 
literal is *fixed* and is not subject to variation from one 
interpretation to another; but that also, in an odd way, the precise 
meaning of any given literal is hostage to the datatyping that is 
used to interpret it, and different occurrences might be interpreted 
differently. It also makes it clear how the rest of the RDF graph, 
while it may not change the datatyping mappings themselves (they are 
still 'global' to the graph) , can provide information (ie constrain 
the satisfying interpretations) which forces a particular literal 
label to be interpreted according to one or another datatyping scheme.

[Aside. Heres how to say this without using the DT /DTS/DTC stuff. 
Just  assume a global mapping XD from urirefs to datatypes, and 
define a datatyping as a mapping D from nodes to datatypes.  Then the 
first two conditions above can be stated:
1. LV(n)=D(n)(label(n))
2. ICEXT(I(x)) is a subset of {y: XD(x)(lit)=y & lit is a literal}
This shows how very simply it really is, but I still prefer the 
earlier way of stating it.]

How this works.

To illustrate this idea I will use the example graph given earlier. 
Since I want to refer to the actual nodes of the graph, however, I 
will use Ntriples++ notation to give the literal nodes labels, OK? 
Remember, this is the *same* graph, Ive just given two of the nodes 
unique labels in the Ntriples, is all.

aaa rdfs:range xsd:integer .
bbb rdfs:range xsd:string .
foo aaa _:1:"101001" .
baz bbb _:2:"101001" .

Here is one datatyping of this graph:

DTS(D(_:1)) = (lambda (?x)(eval ?x))
DTS(D(_:2)) = (lambda (?x) 17)

This says that literals on node#1 are interpreted by LISP and 
anything on node#2 is interpreted to be the number 17. Fortunately, 
that datatyping scheme has no URI, but it is a scheme. There are 
infinitely many damn silly schemes, but most of them wouldn't make 
the graph come out to be true no matter what interpretation you were 
to combine them with.

Obviously the scheme we want is this:
DTS(D(_:1)) = XD(xsd:integer)
DTS(D(_:2)) = XD(xsd:string))

And the cute thing is that this is guaranteed by the semantic 
conditions, since the graph says enough to lock down this as the only 
possible datatyping that would yield a satisfying typed interpetation 
<I,D>. Let me just do this for the integer case. If  <I,D> is a typed 
interpretation satisfying this graph, then I must recognize 
xsd:integer, so from 3:
I(xsd:integer) = XD(xsd:integer)
and by 2,
ICEXT(I(xsd:integer)) must be a subset of DTC(I(xsd:integer)), ie of 
DTC(XD(xsd:integer)), ie the value space of xsd: integers. It follows 
from the ordinary rdfs:range semantic conditions that the only way 
that I could make the fourth triple true is if XL(_:1) is in the 
class DTC(XD(xsd:integer)), which requires (by the first semantic 
condition) that the datatyping D(_:1) on _:1 be the lexical-value 
mapping specified by xsd:integer, since
XL(_:1)= DTS(D(_:1))("101001").
(Actually what I said above isn't exactly true. The conditions do not 
guarantee that this datatyping is specified uniquely; but they do 
impose that whatever datatyping is used, it must *agree* with 
xsd:integer and xsd:string on all the literal nodes in the graph. I 
could invent a silly datatyping which was like xsd on all numerals 
less than a googleplex, but switched to hexadecimal, say.  That would 
work. The point being that you wouldn't know the difference, so it 
doesn't matter.)

Alternative syntaxes

Now let me illustrate how a variety of alternative ideas for 
specifying datatyping information in RDF graphs can be seen as 
extensions to this, by adding different kinds of constraints on 
datatyped interpretations.

1. Explicit schema datatyping.

Assume that every occurrence of a literal is somehow explicitly 
labelled with a datatype in the graph itself, eg by redefining 
'literal labels' to be pairs of a literal and a uri indicating a 
datatype. Let me use label(n) and dtype(n) respectively for these two 
parts of the literal label at a node; then simply define a datatyping 
scheme in the obvious way, by requiring it to be defined by the dtype 
labels:
  4.1   D(n)=I(dtype(n))

Substituting 4.1 into 1 gives:

LV(n) = DTS(I(dtype(n)))(label(n))

and assuming that dtype(n) is indeed a datatype uriref, then 3 in turn gives:

LV(n)= DTS(XD(dtype(n)))(label(n))

showing that in this case we can eliminate D from the equations 
altogether, and simply use the single 'global fixed' mapping XD to 
determine the literal value of a literal independently of the 
interpretation I and hence independently of the rest of the RDF 
graph. (This is in fact what I had in the back of my mind when 
defining the original MT, by declaring XL to be a global, fixed 
mapping.) So if we were to impose this very strict local datatype 
labelling scheme, in effect incorporating the datatype into the 
literal label itself, then there really is no need for this extension 
to the model theory. However, what this does show is that this kind 
of strict labelling scheme does not contradict the more flexible 
scheme, so they could be used together without any risk of being 
mutually incompatible.

2. Bnodes as literal values.

Let me put together here under one heading a variety of variations of 
the following theme: that occurrences of literals in value position 
in a triple should be understood as an abbreviation of a pair of 
triples with an 'intermediate' bnode, where the bnode denotes the 
triple value and the literal label itself is relegated to a minor 
role of somehow 'illustrating' how that value could be written in 
some notation. For example: rewrite

aaa bbb lit .

as
aaa bbb _:1 .
_:1 rdf:value lit .

Before analyzing this, I note that it has the advantage of putting a 
node that denotes the literal value in subject position, where other 
properties (such as rdf:type) can be asserted of it. This is indeed a 
major advantage - I think the *only* advantage - of this proposal, 
but we could render this moot simply by declaring that RDF shall 
allow literals in subject position. I will return to this point later.

The first of these triples seems easy to interpret: it means exactly 
what the original triple meant, in fact. It's the second triple that 
seems rather odd. Since _:1 is supposed to denote the literal value, 
it would seem to have the same value at both ends. We have provided 
now two nodes to refer to the same thing: one blank, but in subject 
position; the other with a literal label to say what it is. The 
first, blank, node can now safely be asserted to have an rdf:type 
which is a datatype, and if we make reasonable assumptions about 
interpretations being in accordance with the global XD mappings from 
datatype URIs, (similar to the conditions listed here) then we will 
be able to infer the datatypes of those blank nodes by normal rdfs 
inference.  But the only way to know what the first node actually 
*means* is to look at the label on the second node. So we now have an 
odd juxtaposition, where rdf:value has a literal label at one end 
which has no assigned datatype, and a datatype at the other but no 
literal there to use it on.

There are several ways to get around this. The simplest, in the 
current framework, is simply to declare that rdf:value is equality, 
ie that <x,y> is in IEXT(I(rdf:value)) iff x=y. This effectively 
forces I(_:1) to be the same as XL(lit), and then the <I,D> semantic 
machinery works in this case in exactly the same way that it works in 
the 'plain' case.  This however is not how the proponents of this 
kind of scheme usually think of it.

Another way is to insist that the literal label in the second triple 
is treated in a nonstandard way: rather than denoting a literal 
value, it is being mentioned rather than used; it simply indicates 
the literal itself. (Or, equivalently, all literals are treated as 
strings.) The intuitive meaning of rdf:value is then a kind of 
inversion of the denotation mapping itself: it assigns a literal 
label to the literal value that is the semantic value of that label 
in the given interpretation. This seems to be what Sergey has in 
mind, and it is also I think what Dan C. suggested a while back as 
the best way to handle literals.

Now, this seems to me to have a fatal flaw, which arises from the 
fact that the value spaces of two different datatypes might overlap. 
For example, suppose that there are datatypes xxd:octal and 
xxd:decimal, then the following would seem to be perfectly true:

_:1 rdf:type xxd:octal
_:1 rdf:type xxd:decimal
_:1 rdf:value "32"
_:1 rdf:value "26"

since indeed the number twenty-six is the value of both a decimal and 
an octal numeral, so that number, which is what I(_:1) should be in 
any satisfying interpretation, is in both class extensions, and <26, 
"26"> is in IEXT(I(rdf:value)) when 26 is of type xxd:decimal, and 
<26,"32"> is in it when 26 is of type xxd:octal, so *both* of those 
pairs have to be in it. The point being that in cases like this, it 
isn't enough to just attach a datatype to the *interpretation* of the 
literal, ie the literal value denoted by the blank node. You have to 
somehow get it attached *to the literal itself*, and by making the 
separation a syntactic separation, there is no way to do that. The 
only way to do that in this kind of scheme would be to impose a 
syntactic constraint on RDF graphs that required any node to be the 
subject of at most one rdf:value arc; and since literals only appear 
at the object ends of those arcs, the only function of the rdf:value 
edge in the graph is to attach a unique literal label to the blank 
node. It seems less trouble, and much clearer in meaning, to simply 
attach it directly to the 'blank' node and throw that edge away, and 
then we are back where we started. (Notice that if we did that, then 
this kind of pathological example becomes impossible, since we would 
have to attach two different literal labels to the 'blank' node. The 
syntactic separation of lexical and value spaces in the RDF graph 
simply creates new opportunities for confusion, which is what usually 
happens when one gets use and mention mixed up in this kind of way.)

On balance, therefore, I would prefer to adopt the first 
interpretation of rdf:value as simply meaning equality. However, if 
we are going to introduce equality into RDF, let us do so properly. 
This is quite a significant change to the language, making it much 
more expressive. We ought to be able to make inferences which follow 
from the transitivity and substutivity of equality, for example, and 
to be able to assert equalities between urirefs as well as literals 
and blank nodes.

On the other hand, we could also gain all the expressive advantages 
of this device for literals without going this far, by making one 
simple change.

3. Literals as subjects.

This proposal modifies the RDF syntax in a small way, by allowing 
literals to be subjects. The current MT goes through in exactly the 
same way, except of course the artificial restrictions in the closure 
rules for RDFS closures are removed. This would have several notable 
advantages for literals.

First, information about literal nodes - in particular, their data 
type - can be expressed directly, instead of resorting to subterfuges 
like the introduction of blank nodes. In fact, as pointed out above, 
you can think of this as what you would get just by taking the 
notational devices used in the 'blank-node' syntax proposals and 
conflating the blank node with the literal node that it is linked to. 
For example, the following graph written in bnode-style:

aaa bbb _:1 .
_:1 rdf:value "345" .
_:1 rdf:type xsd:integer .

would be boiled down into:

aaa bbb _:1:"345" .
_:1 rdf:type xsd:integer .

where I have been obliged to use Ntriples++ to indicate that the 
subject node of the second triple is the object node of the first one.

(Notice that the use of nodeIDs in this style of Ntriples++ notation 
is *exactly* parallel to its use in the 'bnode' graph; the only 
difference is that some of the nodes being identified are no longer 
blank. In fact, one could get the Ntriples++ for the second graph 
from the Ntriples for the first one simply by making sure that the 
rdf:value triples occur after the triples that generated them, and 
then replacing all strings of the form
A<white>. <white>?<newline><white>?A<white>rdf:value<white>
where A is a nodeID, with
A:
and leaving the rest alone. )

Second, this now gives us a very sweet way to characterize how to 
determine the datatype of any given literal node: Generate the rdfs 
closure of the graph, and see if it contains

_:node rdf:type <datatype> .

where _:node is the ID of the node and <datatype> is a datatype URI. 
If it does, then the literal on that node has to be interpreted in 
accordance with that datatype; if not, it doesn't. This works no 
matter how the datatyping information is provided; it could be said 
directly, or inferred from range information or even by 
subclass-transitivity inference; all those variations are absorbed in 
the details of the rdfs closure rules. (If it is provided by explicit 
node labelling then we would need to incorporate an extra closure 
rule to get it from the label into the graph explicitly.) The key 
point is, that if the graph somehow establishes membership in a class 
known to be a datatype, then that fixes it; if not, it is not fixed.

OK, that's all for now. I have to go to bed, I'm bushed.

Pat

PS. How about having an rdfs class called rdfs:Datatype? Then the 
semantic rule on recognition could be relativised to membership in 
that class, and an rdfs graph would have an explicit internal note of 
the datatype/simple-class distinction.



-- 
---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes@ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes
Received on Thursday, 1 November 2001 19:31:15 UTC