- From: Pat Hayes <phayes@ai.uwf.edu>
- Date: Thu, 1 Nov 2001 18:31:11 -0600
- To: w3c-rdfcore-wg@w3.org, pfps@research.bell-labs.com
I'm sorry I dropped the ball on this issue just as it heated up. Ive
been (literally) laid low with flu since Friday and unable to do
anything much except whimper and cough.
Let me try to first summarize the MT changes that I managed to
extract from the pfps/ph interchange - I think these are all pretty
much the same as what Peter got out of it, in all but stylistic
details - and then use this extended MT to respond to a lot of the
recent comments, some of which I think are based on
misunderstandings. I will also try to show how *all* the various
proposals for encoding datatyping information in RDF syntax can be
incorporated into this MT extension in a uniform way, and so can a
few others that havn't been made yet.
OK. The basic idea of the MT extension arises from the fact that
while the same literal label might mean different things in different
contexts, datatyping information seems to remove what would otherwise
be an 'ambiguity' when one looks at particular occurrences of literal
labels. For example, it seems perfectly clear, just speaking
intuitively, that the following graph
aaa rdfs:range xsd:integer .
bbb rdfs:range xsd:string .
foo aaa "101001" .
baz bbb "101001" .
can be unambiguously interpreted as saying that the aaa-value of foo
is the integer (89456 + 11545) and the bbb-value of baz is the
character string "101001". (The double quotes in the RDF triples are
not to be interpreted as actual quotation marks, of course, but only
as literal-markers, a convention which I abhor but will stick with
throughout this message.) The fact that there is no ambiguity in this
graph, even though the same syntactic label is used with two
different meanings, suggests that the datatyping information should
not be thought of as a mapping from the literal labels - which is
what one would get by extending the MT in a very straightforward way
by simply including literal labels into the vocabulary of an
interpretation - but rather as attached to the node itself. The
datatyping extension to the MT therefore introduces a new kind of
mapping, which is very similar to an interpretation mapping, but
which applies to the nodes of the graph rather than to the literals
which are used to label those nodes. We might call such a thing a
'datatyping interpretation', but I think that might be confusing, so
I will call it a 'typing'. An interpretation is a mapping from a URI
vocabulary to entities; a typing is a mapping from nodes of the graph
to datatypes. (I'll tell you what a datatype is in a minute.)
The current MT does need to be altered in a tiny degree to make this
possible. Currently it says that XL is a global mapping from literals
to LV, and then it uses that mapping once, and never mentions it
again. We need to change that so we define XL to be a mapping from
*occurrences* (tokens, inscriptions, whatever) of literals to LV; or,
less mysteriously, from *literal nodes* to LV. Then the model theory
works exactly as before; this mapping is still 'global' (though that
is now perhaps an unfortunate word to use) in the sense that it is
independent of the interpretation. (When we introduce a datatyping
scheme, however, there is some connection between them, as we will
see.)
Technicalities.
A datatype is a mapping from a lexical domain (a subset of literals)
to a range of values (a set). A datatype scheme is a set DT plus a
fixed mapping DTS from DT to datatypes. It is convenient to define
DTC(x) to be the range (not rdfs:Range, just range) of the datatype
DTS(x). If nnn is the URI of a type, we will write XD(nnn) for the
member of DT that the uri identifies. This mapping XD is similar in
some ways to an interpretation mapping I, but unlike I, it is assumed
that XD is computable (in some way), ie given the uri of a datatyping
scheme, the machine can somehow access the actual DTS and DTC
mappings associated with that datatyping scheme and apply them.
(DTS and DT aren't *strictly* needed, in fact; one could have a
single global mapping directly from urirefs to datatypes; but it is
in the same spirit as the rest of the model theory so I will go on
doing it this way. Think of the member of DT as the abstract
datatype-thingie and DTS as the datatype version of the IEXT mapping
in the MT. DTC is analogous to ICEXT.)
Now, a datatyping of a graph is simply a mapping from the literal
nodes of the graph to a datatype scheme, ie an assignment of a type
to each literal node of the graph. Of course, a graph by itself might
have any number of datatypings, just as a vocabulary can have any
number of interpretations, but we want to make sure that if we have
one of each than that fixes the interpretation of every part of a
labelled graph unambiguously, by placing mutual constraints on how
they act together. So, in this spirit, a typed interpretation is a
pair <I,D> of an interpretation (of the uri vocabulary) and a
datyping D (of the literal nodes of the graph) which together satisfy
the following constraints:
1. XL(n) = DTS(D(n))(label(n))
2. ICEXT(d) is a subset of DTC(d) for any d in (DT intersect IR)
(In 1 we are using two rather different kinds of function; label(n)
is just a kind of syntactic selection function which refers to the
label on the node n, while the other mappings are semantic.
Mathematically these are all just functions, but intuitively they
have different roles.)
What these mean, intuitively, is exactly what one would expect; that
the value of a literal (in the interpretation) is understood to be
determined by the datatyping of the node on which it occurs; and that
datatypes are treated appropriately as rdfs classes, in the sense
that the RDFS class of a given datatype in the interpretation is a
subset of set of things which actually have that datatype according
to D.
Now, the only other semantic condition we need on an interpretation
is that it 'understands' that the urirefs that denote datatypes are
in fact interpreted to refer to those datatypes, which can be stated
as the condition
3. I(nnn)= XD(nnn)
If I satisfies condition 3 for the datatype named nnn then we will
say that I 'recognizes' that datatype, and it is natural to require
that I recognizes every datatype which is mentioned in the graph (ie
where nnn occurs in the graph as a node label), so we will add this
as a third condition on a typed interpretation <I,D>.
OK, that is all we require. We could sum all this up by saying that I
has to 'respect' D by agreeing to use the datatype URIs in the
appropriate ways, and by agreeing to interpret the datatype mappings
as rdf:properties in a consistent way (consistent with D, that is.)
In return, D will undertake to guarantee that all literal occurrences
are interpreted in a way that will be consistent with anything that
is said in the RDF triples. As long as I and D promise to keep to
their respective vows concerning each other, we can be sure that
their marriage will be happy.
(A less anthropomorphic analogy might be to think of D as a kind of
datatyping network between nodes along which datatyping information
can flow, and which can carry different information to different
nodes.)
This isn't really very different from extending the interpretation
idea from urirefs to literals, but it acknowledges explicitly the
ways that literal evaluation differs from uriref interpretations; the
fact that once the datatyping has been determined, the value of the
literal is *fixed* and is not subject to variation from one
interpretation to another; but that also, in an odd way, the precise
meaning of any given literal is hostage to the datatyping that is
used to interpret it, and different occurrences might be interpreted
differently. It also makes it clear how the rest of the RDF graph,
while it may not change the datatyping mappings themselves (they are
still 'global' to the graph) , can provide information (ie constrain
the satisfying interpretations) which forces a particular literal
label to be interpreted according to one or another datatyping scheme.
[Aside. Heres how to say this without using the DT /DTS/DTC stuff.
Just assume a global mapping XD from urirefs to datatypes, and
define a datatyping as a mapping D from nodes to datatypes. Then the
first two conditions above can be stated:
1. LV(n)=D(n)(label(n))
2. ICEXT(I(x)) is a subset of {y: XD(x)(lit)=y & lit is a literal}
This shows how very simply it really is, but I still prefer the
earlier way of stating it.]
How this works.
To illustrate this idea I will use the example graph given earlier.
Since I want to refer to the actual nodes of the graph, however, I
will use Ntriples++ notation to give the literal nodes labels, OK?
Remember, this is the *same* graph, Ive just given two of the nodes
unique labels in the Ntriples, is all.
aaa rdfs:range xsd:integer .
bbb rdfs:range xsd:string .
foo aaa _:1:"101001" .
baz bbb _:2:"101001" .
Here is one datatyping of this graph:
DTS(D(_:1)) = (lambda (?x)(eval ?x))
DTS(D(_:2)) = (lambda (?x) 17)
This says that literals on node#1 are interpreted by LISP and
anything on node#2 is interpreted to be the number 17. Fortunately,
that datatyping scheme has no URI, but it is a scheme. There are
infinitely many damn silly schemes, but most of them wouldn't make
the graph come out to be true no matter what interpretation you were
to combine them with.
Obviously the scheme we want is this:
DTS(D(_:1)) = XD(xsd:integer)
DTS(D(_:2)) = XD(xsd:string))
And the cute thing is that this is guaranteed by the semantic
conditions, since the graph says enough to lock down this as the only
possible datatyping that would yield a satisfying typed interpetation
<I,D>. Let me just do this for the integer case. If <I,D> is a typed
interpretation satisfying this graph, then I must recognize
xsd:integer, so from 3:
I(xsd:integer) = XD(xsd:integer)
and by 2,
ICEXT(I(xsd:integer)) must be a subset of DTC(I(xsd:integer)), ie of
DTC(XD(xsd:integer)), ie the value space of xsd: integers. It follows
from the ordinary rdfs:range semantic conditions that the only way
that I could make the fourth triple true is if XL(_:1) is in the
class DTC(XD(xsd:integer)), which requires (by the first semantic
condition) that the datatyping D(_:1) on _:1 be the lexical-value
mapping specified by xsd:integer, since
XL(_:1)= DTS(D(_:1))("101001").
(Actually what I said above isn't exactly true. The conditions do not
guarantee that this datatyping is specified uniquely; but they do
impose that whatever datatyping is used, it must *agree* with
xsd:integer and xsd:string on all the literal nodes in the graph. I
could invent a silly datatyping which was like xsd on all numerals
less than a googleplex, but switched to hexadecimal, say. That would
work. The point being that you wouldn't know the difference, so it
doesn't matter.)
Alternative syntaxes
Now let me illustrate how a variety of alternative ideas for
specifying datatyping information in RDF graphs can be seen as
extensions to this, by adding different kinds of constraints on
datatyped interpretations.
1. Explicit schema datatyping.
Assume that every occurrence of a literal is somehow explicitly
labelled with a datatype in the graph itself, eg by redefining
'literal labels' to be pairs of a literal and a uri indicating a
datatype. Let me use label(n) and dtype(n) respectively for these two
parts of the literal label at a node; then simply define a datatyping
scheme in the obvious way, by requiring it to be defined by the dtype
labels:
4.1 D(n)=I(dtype(n))
Substituting 4.1 into 1 gives:
LV(n) = DTS(I(dtype(n)))(label(n))
and assuming that dtype(n) is indeed a datatype uriref, then 3 in turn gives:
LV(n)= DTS(XD(dtype(n)))(label(n))
showing that in this case we can eliminate D from the equations
altogether, and simply use the single 'global fixed' mapping XD to
determine the literal value of a literal independently of the
interpretation I and hence independently of the rest of the RDF
graph. (This is in fact what I had in the back of my mind when
defining the original MT, by declaring XL to be a global, fixed
mapping.) So if we were to impose this very strict local datatype
labelling scheme, in effect incorporating the datatype into the
literal label itself, then there really is no need for this extension
to the model theory. However, what this does show is that this kind
of strict labelling scheme does not contradict the more flexible
scheme, so they could be used together without any risk of being
mutually incompatible.
2. Bnodes as literal values.
Let me put together here under one heading a variety of variations of
the following theme: that occurrences of literals in value position
in a triple should be understood as an abbreviation of a pair of
triples with an 'intermediate' bnode, where the bnode denotes the
triple value and the literal label itself is relegated to a minor
role of somehow 'illustrating' how that value could be written in
some notation. For example: rewrite
aaa bbb lit .
as
aaa bbb _:1 .
_:1 rdf:value lit .
Before analyzing this, I note that it has the advantage of putting a
node that denotes the literal value in subject position, where other
properties (such as rdf:type) can be asserted of it. This is indeed a
major advantage - I think the *only* advantage - of this proposal,
but we could render this moot simply by declaring that RDF shall
allow literals in subject position. I will return to this point later.
The first of these triples seems easy to interpret: it means exactly
what the original triple meant, in fact. It's the second triple that
seems rather odd. Since _:1 is supposed to denote the literal value,
it would seem to have the same value at both ends. We have provided
now two nodes to refer to the same thing: one blank, but in subject
position; the other with a literal label to say what it is. The
first, blank, node can now safely be asserted to have an rdf:type
which is a datatype, and if we make reasonable assumptions about
interpretations being in accordance with the global XD mappings from
datatype URIs, (similar to the conditions listed here) then we will
be able to infer the datatypes of those blank nodes by normal rdfs
inference. But the only way to know what the first node actually
*means* is to look at the label on the second node. So we now have an
odd juxtaposition, where rdf:value has a literal label at one end
which has no assigned datatype, and a datatype at the other but no
literal there to use it on.
There are several ways to get around this. The simplest, in the
current framework, is simply to declare that rdf:value is equality,
ie that <x,y> is in IEXT(I(rdf:value)) iff x=y. This effectively
forces I(_:1) to be the same as XL(lit), and then the <I,D> semantic
machinery works in this case in exactly the same way that it works in
the 'plain' case. This however is not how the proponents of this
kind of scheme usually think of it.
Another way is to insist that the literal label in the second triple
is treated in a nonstandard way: rather than denoting a literal
value, it is being mentioned rather than used; it simply indicates
the literal itself. (Or, equivalently, all literals are treated as
strings.) The intuitive meaning of rdf:value is then a kind of
inversion of the denotation mapping itself: it assigns a literal
label to the literal value that is the semantic value of that label
in the given interpretation. This seems to be what Sergey has in
mind, and it is also I think what Dan C. suggested a while back as
the best way to handle literals.
Now, this seems to me to have a fatal flaw, which arises from the
fact that the value spaces of two different datatypes might overlap.
For example, suppose that there are datatypes xxd:octal and
xxd:decimal, then the following would seem to be perfectly true:
_:1 rdf:type xxd:octal
_:1 rdf:type xxd:decimal
_:1 rdf:value "32"
_:1 rdf:value "26"
since indeed the number twenty-six is the value of both a decimal and
an octal numeral, so that number, which is what I(_:1) should be in
any satisfying interpretation, is in both class extensions, and <26,
"26"> is in IEXT(I(rdf:value)) when 26 is of type xxd:decimal, and
<26,"32"> is in it when 26 is of type xxd:octal, so *both* of those
pairs have to be in it. The point being that in cases like this, it
isn't enough to just attach a datatype to the *interpretation* of the
literal, ie the literal value denoted by the blank node. You have to
somehow get it attached *to the literal itself*, and by making the
separation a syntactic separation, there is no way to do that. The
only way to do that in this kind of scheme would be to impose a
syntactic constraint on RDF graphs that required any node to be the
subject of at most one rdf:value arc; and since literals only appear
at the object ends of those arcs, the only function of the rdf:value
edge in the graph is to attach a unique literal label to the blank
node. It seems less trouble, and much clearer in meaning, to simply
attach it directly to the 'blank' node and throw that edge away, and
then we are back where we started. (Notice that if we did that, then
this kind of pathological example becomes impossible, since we would
have to attach two different literal labels to the 'blank' node. The
syntactic separation of lexical and value spaces in the RDF graph
simply creates new opportunities for confusion, which is what usually
happens when one gets use and mention mixed up in this kind of way.)
On balance, therefore, I would prefer to adopt the first
interpretation of rdf:value as simply meaning equality. However, if
we are going to introduce equality into RDF, let us do so properly.
This is quite a significant change to the language, making it much
more expressive. We ought to be able to make inferences which follow
from the transitivity and substutivity of equality, for example, and
to be able to assert equalities between urirefs as well as literals
and blank nodes.
On the other hand, we could also gain all the expressive advantages
of this device for literals without going this far, by making one
simple change.
3. Literals as subjects.
This proposal modifies the RDF syntax in a small way, by allowing
literals to be subjects. The current MT goes through in exactly the
same way, except of course the artificial restrictions in the closure
rules for RDFS closures are removed. This would have several notable
advantages for literals.
First, information about literal nodes - in particular, their data
type - can be expressed directly, instead of resorting to subterfuges
like the introduction of blank nodes. In fact, as pointed out above,
you can think of this as what you would get just by taking the
notational devices used in the 'blank-node' syntax proposals and
conflating the blank node with the literal node that it is linked to.
For example, the following graph written in bnode-style:
aaa bbb _:1 .
_:1 rdf:value "345" .
_:1 rdf:type xsd:integer .
would be boiled down into:
aaa bbb _:1:"345" .
_:1 rdf:type xsd:integer .
where I have been obliged to use Ntriples++ to indicate that the
subject node of the second triple is the object node of the first one.
(Notice that the use of nodeIDs in this style of Ntriples++ notation
is *exactly* parallel to its use in the 'bnode' graph; the only
difference is that some of the nodes being identified are no longer
blank. In fact, one could get the Ntriples++ for the second graph
from the Ntriples for the first one simply by making sure that the
rdf:value triples occur after the triples that generated them, and
then replacing all strings of the form
A<white>. <white>?<newline><white>?A<white>rdf:value<white>
where A is a nodeID, with
A:
and leaving the rest alone. )
Second, this now gives us a very sweet way to characterize how to
determine the datatype of any given literal node: Generate the rdfs
closure of the graph, and see if it contains
_:node rdf:type <datatype> .
where _:node is the ID of the node and <datatype> is a datatype URI.
If it does, then the literal on that node has to be interpreted in
accordance with that datatype; if not, it doesn't. This works no
matter how the datatyping information is provided; it could be said
directly, or inferred from range information or even by
subclass-transitivity inference; all those variations are absorbed in
the details of the rdfs closure rules. (If it is provided by explicit
node labelling then we would need to incorporate an extra closure
rule to get it from the label into the graph explicitly.) The key
point is, that if the graph somehow establishes membership in a class
known to be a datatype, then that fixes it; if not, it is not fixed.
OK, that's all for now. I have to go to bed, I'm bushed.
Pat
PS. How about having an rdfs class called rdfs:Datatype? Then the
semantic rule on recognition could be relativised to membership in
that class, and an rdfs graph would have an explicit internal note of
the datatype/simple-class distinction.
--
---------------------------------------------------------------------
IHMC (850)434 8903 home
40 South Alcaniz St. (850)202 4416 office
Pensacola, FL 32501 (850)202 4440 fax
phayes@ai.uwf.edu
http://www.coginst.uwf.edu/~phayes
Received on Thursday, 1 November 2001 19:31:15 UTC