- From: Pat Hayes <phayes@ai.uwf.edu>
- Date: Thu, 1 Nov 2001 18:31:11 -0600
- To: w3c-rdfcore-wg@w3.org, pfps@research.bell-labs.com
I'm sorry I dropped the ball on this issue just as it heated up. Ive been (literally) laid low with flu since Friday and unable to do anything much except whimper and cough. Let me try to first summarize the MT changes that I managed to extract from the pfps/ph interchange - I think these are all pretty much the same as what Peter got out of it, in all but stylistic details - and then use this extended MT to respond to a lot of the recent comments, some of which I think are based on misunderstandings. I will also try to show how *all* the various proposals for encoding datatyping information in RDF syntax can be incorporated into this MT extension in a uniform way, and so can a few others that havn't been made yet. OK. The basic idea of the MT extension arises from the fact that while the same literal label might mean different things in different contexts, datatyping information seems to remove what would otherwise be an 'ambiguity' when one looks at particular occurrences of literal labels. For example, it seems perfectly clear, just speaking intuitively, that the following graph aaa rdfs:range xsd:integer . bbb rdfs:range xsd:string . foo aaa "101001" . baz bbb "101001" . can be unambiguously interpreted as saying that the aaa-value of foo is the integer (89456 + 11545) and the bbb-value of baz is the character string "101001". (The double quotes in the RDF triples are not to be interpreted as actual quotation marks, of course, but only as literal-markers, a convention which I abhor but will stick with throughout this message.) The fact that there is no ambiguity in this graph, even though the same syntactic label is used with two different meanings, suggests that the datatyping information should not be thought of as a mapping from the literal labels - which is what one would get by extending the MT in a very straightforward way by simply including literal labels into the vocabulary of an interpretation - but rather as attached to the node itself. The datatyping extension to the MT therefore introduces a new kind of mapping, which is very similar to an interpretation mapping, but which applies to the nodes of the graph rather than to the literals which are used to label those nodes. We might call such a thing a 'datatyping interpretation', but I think that might be confusing, so I will call it a 'typing'. An interpretation is a mapping from a URI vocabulary to entities; a typing is a mapping from nodes of the graph to datatypes. (I'll tell you what a datatype is in a minute.) The current MT does need to be altered in a tiny degree to make this possible. Currently it says that XL is a global mapping from literals to LV, and then it uses that mapping once, and never mentions it again. We need to change that so we define XL to be a mapping from *occurrences* (tokens, inscriptions, whatever) of literals to LV; or, less mysteriously, from *literal nodes* to LV. Then the model theory works exactly as before; this mapping is still 'global' (though that is now perhaps an unfortunate word to use) in the sense that it is independent of the interpretation. (When we introduce a datatyping scheme, however, there is some connection between them, as we will see.) Technicalities. A datatype is a mapping from a lexical domain (a subset of literals) to a range of values (a set). A datatype scheme is a set DT plus a fixed mapping DTS from DT to datatypes. It is convenient to define DTC(x) to be the range (not rdfs:Range, just range) of the datatype DTS(x). If nnn is the URI of a type, we will write XD(nnn) for the member of DT that the uri identifies. This mapping XD is similar in some ways to an interpretation mapping I, but unlike I, it is assumed that XD is computable (in some way), ie given the uri of a datatyping scheme, the machine can somehow access the actual DTS and DTC mappings associated with that datatyping scheme and apply them. (DTS and DT aren't *strictly* needed, in fact; one could have a single global mapping directly from urirefs to datatypes; but it is in the same spirit as the rest of the model theory so I will go on doing it this way. Think of the member of DT as the abstract datatype-thingie and DTS as the datatype version of the IEXT mapping in the MT. DTC is analogous to ICEXT.) Now, a datatyping of a graph is simply a mapping from the literal nodes of the graph to a datatype scheme, ie an assignment of a type to each literal node of the graph. Of course, a graph by itself might have any number of datatypings, just as a vocabulary can have any number of interpretations, but we want to make sure that if we have one of each than that fixes the interpretation of every part of a labelled graph unambiguously, by placing mutual constraints on how they act together. So, in this spirit, a typed interpretation is a pair <I,D> of an interpretation (of the uri vocabulary) and a datyping D (of the literal nodes of the graph) which together satisfy the following constraints: 1. XL(n) = DTS(D(n))(label(n)) 2. ICEXT(d) is a subset of DTC(d) for any d in (DT intersect IR) (In 1 we are using two rather different kinds of function; label(n) is just a kind of syntactic selection function which refers to the label on the node n, while the other mappings are semantic. Mathematically these are all just functions, but intuitively they have different roles.) What these mean, intuitively, is exactly what one would expect; that the value of a literal (in the interpretation) is understood to be determined by the datatyping of the node on which it occurs; and that datatypes are treated appropriately as rdfs classes, in the sense that the RDFS class of a given datatype in the interpretation is a subset of set of things which actually have that datatype according to D. Now, the only other semantic condition we need on an interpretation is that it 'understands' that the urirefs that denote datatypes are in fact interpreted to refer to those datatypes, which can be stated as the condition 3. I(nnn)= XD(nnn) If I satisfies condition 3 for the datatype named nnn then we will say that I 'recognizes' that datatype, and it is natural to require that I recognizes every datatype which is mentioned in the graph (ie where nnn occurs in the graph as a node label), so we will add this as a third condition on a typed interpretation <I,D>. OK, that is all we require. We could sum all this up by saying that I has to 'respect' D by agreeing to use the datatype URIs in the appropriate ways, and by agreeing to interpret the datatype mappings as rdf:properties in a consistent way (consistent with D, that is.) In return, D will undertake to guarantee that all literal occurrences are interpreted in a way that will be consistent with anything that is said in the RDF triples. As long as I and D promise to keep to their respective vows concerning each other, we can be sure that their marriage will be happy. (A less anthropomorphic analogy might be to think of D as a kind of datatyping network between nodes along which datatyping information can flow, and which can carry different information to different nodes.) This isn't really very different from extending the interpretation idea from urirefs to literals, but it acknowledges explicitly the ways that literal evaluation differs from uriref interpretations; the fact that once the datatyping has been determined, the value of the literal is *fixed* and is not subject to variation from one interpretation to another; but that also, in an odd way, the precise meaning of any given literal is hostage to the datatyping that is used to interpret it, and different occurrences might be interpreted differently. It also makes it clear how the rest of the RDF graph, while it may not change the datatyping mappings themselves (they are still 'global' to the graph) , can provide information (ie constrain the satisfying interpretations) which forces a particular literal label to be interpreted according to one or another datatyping scheme. [Aside. Heres how to say this without using the DT /DTS/DTC stuff. Just assume a global mapping XD from urirefs to datatypes, and define a datatyping as a mapping D from nodes to datatypes. Then the first two conditions above can be stated: 1. LV(n)=D(n)(label(n)) 2. ICEXT(I(x)) is a subset of {y: XD(x)(lit)=y & lit is a literal} This shows how very simply it really is, but I still prefer the earlier way of stating it.] How this works. To illustrate this idea I will use the example graph given earlier. Since I want to refer to the actual nodes of the graph, however, I will use Ntriples++ notation to give the literal nodes labels, OK? Remember, this is the *same* graph, Ive just given two of the nodes unique labels in the Ntriples, is all. aaa rdfs:range xsd:integer . bbb rdfs:range xsd:string . foo aaa _:1:"101001" . baz bbb _:2:"101001" . Here is one datatyping of this graph: DTS(D(_:1)) = (lambda (?x)(eval ?x)) DTS(D(_:2)) = (lambda (?x) 17) This says that literals on node#1 are interpreted by LISP and anything on node#2 is interpreted to be the number 17. Fortunately, that datatyping scheme has no URI, but it is a scheme. There are infinitely many damn silly schemes, but most of them wouldn't make the graph come out to be true no matter what interpretation you were to combine them with. Obviously the scheme we want is this: DTS(D(_:1)) = XD(xsd:integer) DTS(D(_:2)) = XD(xsd:string)) And the cute thing is that this is guaranteed by the semantic conditions, since the graph says enough to lock down this as the only possible datatyping that would yield a satisfying typed interpetation <I,D>. Let me just do this for the integer case. If <I,D> is a typed interpretation satisfying this graph, then I must recognize xsd:integer, so from 3: I(xsd:integer) = XD(xsd:integer) and by 2, ICEXT(I(xsd:integer)) must be a subset of DTC(I(xsd:integer)), ie of DTC(XD(xsd:integer)), ie the value space of xsd: integers. It follows from the ordinary rdfs:range semantic conditions that the only way that I could make the fourth triple true is if XL(_:1) is in the class DTC(XD(xsd:integer)), which requires (by the first semantic condition) that the datatyping D(_:1) on _:1 be the lexical-value mapping specified by xsd:integer, since XL(_:1)= DTS(D(_:1))("101001"). (Actually what I said above isn't exactly true. The conditions do not guarantee that this datatyping is specified uniquely; but they do impose that whatever datatyping is used, it must *agree* with xsd:integer and xsd:string on all the literal nodes in the graph. I could invent a silly datatyping which was like xsd on all numerals less than a googleplex, but switched to hexadecimal, say. That would work. The point being that you wouldn't know the difference, so it doesn't matter.) Alternative syntaxes Now let me illustrate how a variety of alternative ideas for specifying datatyping information in RDF graphs can be seen as extensions to this, by adding different kinds of constraints on datatyped interpretations. 1. Explicit schema datatyping. Assume that every occurrence of a literal is somehow explicitly labelled with a datatype in the graph itself, eg by redefining 'literal labels' to be pairs of a literal and a uri indicating a datatype. Let me use label(n) and dtype(n) respectively for these two parts of the literal label at a node; then simply define a datatyping scheme in the obvious way, by requiring it to be defined by the dtype labels: 4.1 D(n)=I(dtype(n)) Substituting 4.1 into 1 gives: LV(n) = DTS(I(dtype(n)))(label(n)) and assuming that dtype(n) is indeed a datatype uriref, then 3 in turn gives: LV(n)= DTS(XD(dtype(n)))(label(n)) showing that in this case we can eliminate D from the equations altogether, and simply use the single 'global fixed' mapping XD to determine the literal value of a literal independently of the interpretation I and hence independently of the rest of the RDF graph. (This is in fact what I had in the back of my mind when defining the original MT, by declaring XL to be a global, fixed mapping.) So if we were to impose this very strict local datatype labelling scheme, in effect incorporating the datatype into the literal label itself, then there really is no need for this extension to the model theory. However, what this does show is that this kind of strict labelling scheme does not contradict the more flexible scheme, so they could be used together without any risk of being mutually incompatible. 2. Bnodes as literal values. Let me put together here under one heading a variety of variations of the following theme: that occurrences of literals in value position in a triple should be understood as an abbreviation of a pair of triples with an 'intermediate' bnode, where the bnode denotes the triple value and the literal label itself is relegated to a minor role of somehow 'illustrating' how that value could be written in some notation. For example: rewrite aaa bbb lit . as aaa bbb _:1 . _:1 rdf:value lit . Before analyzing this, I note that it has the advantage of putting a node that denotes the literal value in subject position, where other properties (such as rdf:type) can be asserted of it. This is indeed a major advantage - I think the *only* advantage - of this proposal, but we could render this moot simply by declaring that RDF shall allow literals in subject position. I will return to this point later. The first of these triples seems easy to interpret: it means exactly what the original triple meant, in fact. It's the second triple that seems rather odd. Since _:1 is supposed to denote the literal value, it would seem to have the same value at both ends. We have provided now two nodes to refer to the same thing: one blank, but in subject position; the other with a literal label to say what it is. The first, blank, node can now safely be asserted to have an rdf:type which is a datatype, and if we make reasonable assumptions about interpretations being in accordance with the global XD mappings from datatype URIs, (similar to the conditions listed here) then we will be able to infer the datatypes of those blank nodes by normal rdfs inference. But the only way to know what the first node actually *means* is to look at the label on the second node. So we now have an odd juxtaposition, where rdf:value has a literal label at one end which has no assigned datatype, and a datatype at the other but no literal there to use it on. There are several ways to get around this. The simplest, in the current framework, is simply to declare that rdf:value is equality, ie that <x,y> is in IEXT(I(rdf:value)) iff x=y. This effectively forces I(_:1) to be the same as XL(lit), and then the <I,D> semantic machinery works in this case in exactly the same way that it works in the 'plain' case. This however is not how the proponents of this kind of scheme usually think of it. Another way is to insist that the literal label in the second triple is treated in a nonstandard way: rather than denoting a literal value, it is being mentioned rather than used; it simply indicates the literal itself. (Or, equivalently, all literals are treated as strings.) The intuitive meaning of rdf:value is then a kind of inversion of the denotation mapping itself: it assigns a literal label to the literal value that is the semantic value of that label in the given interpretation. This seems to be what Sergey has in mind, and it is also I think what Dan C. suggested a while back as the best way to handle literals. Now, this seems to me to have a fatal flaw, which arises from the fact that the value spaces of two different datatypes might overlap. For example, suppose that there are datatypes xxd:octal and xxd:decimal, then the following would seem to be perfectly true: _:1 rdf:type xxd:octal _:1 rdf:type xxd:decimal _:1 rdf:value "32" _:1 rdf:value "26" since indeed the number twenty-six is the value of both a decimal and an octal numeral, so that number, which is what I(_:1) should be in any satisfying interpretation, is in both class extensions, and <26, "26"> is in IEXT(I(rdf:value)) when 26 is of type xxd:decimal, and <26,"32"> is in it when 26 is of type xxd:octal, so *both* of those pairs have to be in it. The point being that in cases like this, it isn't enough to just attach a datatype to the *interpretation* of the literal, ie the literal value denoted by the blank node. You have to somehow get it attached *to the literal itself*, and by making the separation a syntactic separation, there is no way to do that. The only way to do that in this kind of scheme would be to impose a syntactic constraint on RDF graphs that required any node to be the subject of at most one rdf:value arc; and since literals only appear at the object ends of those arcs, the only function of the rdf:value edge in the graph is to attach a unique literal label to the blank node. It seems less trouble, and much clearer in meaning, to simply attach it directly to the 'blank' node and throw that edge away, and then we are back where we started. (Notice that if we did that, then this kind of pathological example becomes impossible, since we would have to attach two different literal labels to the 'blank' node. The syntactic separation of lexical and value spaces in the RDF graph simply creates new opportunities for confusion, which is what usually happens when one gets use and mention mixed up in this kind of way.) On balance, therefore, I would prefer to adopt the first interpretation of rdf:value as simply meaning equality. However, if we are going to introduce equality into RDF, let us do so properly. This is quite a significant change to the language, making it much more expressive. We ought to be able to make inferences which follow from the transitivity and substutivity of equality, for example, and to be able to assert equalities between urirefs as well as literals and blank nodes. On the other hand, we could also gain all the expressive advantages of this device for literals without going this far, by making one simple change. 3. Literals as subjects. This proposal modifies the RDF syntax in a small way, by allowing literals to be subjects. The current MT goes through in exactly the same way, except of course the artificial restrictions in the closure rules for RDFS closures are removed. This would have several notable advantages for literals. First, information about literal nodes - in particular, their data type - can be expressed directly, instead of resorting to subterfuges like the introduction of blank nodes. In fact, as pointed out above, you can think of this as what you would get just by taking the notational devices used in the 'blank-node' syntax proposals and conflating the blank node with the literal node that it is linked to. For example, the following graph written in bnode-style: aaa bbb _:1 . _:1 rdf:value "345" . _:1 rdf:type xsd:integer . would be boiled down into: aaa bbb _:1:"345" . _:1 rdf:type xsd:integer . where I have been obliged to use Ntriples++ to indicate that the subject node of the second triple is the object node of the first one. (Notice that the use of nodeIDs in this style of Ntriples++ notation is *exactly* parallel to its use in the 'bnode' graph; the only difference is that some of the nodes being identified are no longer blank. In fact, one could get the Ntriples++ for the second graph from the Ntriples for the first one simply by making sure that the rdf:value triples occur after the triples that generated them, and then replacing all strings of the form A<white>. <white>?<newline><white>?A<white>rdf:value<white> where A is a nodeID, with A: and leaving the rest alone. ) Second, this now gives us a very sweet way to characterize how to determine the datatype of any given literal node: Generate the rdfs closure of the graph, and see if it contains _:node rdf:type <datatype> . where _:node is the ID of the node and <datatype> is a datatype URI. If it does, then the literal on that node has to be interpreted in accordance with that datatype; if not, it doesn't. This works no matter how the datatyping information is provided; it could be said directly, or inferred from range information or even by subclass-transitivity inference; all those variations are absorbed in the details of the rdfs closure rules. (If it is provided by explicit node labelling then we would need to incorporate an extra closure rule to get it from the label into the graph explicitly.) The key point is, that if the graph somehow establishes membership in a class known to be a datatype, then that fixes it; if not, it is not fixed. OK, that's all for now. I have to go to bed, I'm bushed. Pat PS. How about having an rdfs class called rdfs:Datatype? Then the semantic rule on recognition could be relativised to membership in that class, and an rdfs graph would have an explicit internal note of the datatype/simple-class distinction. -- --------------------------------------------------------------------- IHMC (850)434 8903 home 40 South Alcaniz St. (850)202 4416 office Pensacola, FL 32501 (850)202 4440 fax phayes@ai.uwf.edu http://www.coginst.uwf.edu/~phayes
Received on Thursday, 1 November 2001 19:31:15 UTC