simple RDF datatype

Modifications by PatrickS: (in red)

RDF/XML examples added
Changed all occurrences of xsd:integer to xsd:integer
Commented out (~~strikethrough~~) sections/examples with rdfs:dlex (to avoid wasted work on XML examples for stuff that's likely going away...)
Commented out (~~strikethrough~~) examples using octal datatype or addressing localization issues (examples restricted to XML Schema datatypes)

Note: In some of the examples, where rdf:resource is used, I have made use of ENTITYs in a similar fashion to qnames (e.g. "&xsd;integer") but have not defined their expansions. Hopefully, their meaning will be clear.

This version of datatyping, a new variation on the one in

http://www.coginst.uwf.edu/users/phayes/simpledatatype.html

has the following features:

Pro:

a. Literals always denote themselves (so can be tidy)
b. It supports the S-B idiom (using rdfs:drange)
c. It allows the use of S-B, local typing using datatype triples, and range datatyping, in any combination
d. It avoids most datatype clashes and provides a technique for resolving the ones that do arise
e. Datatype class names denote the value space of the datatype.

Con:

f. It requires the use of rdfs:drange (or something other than rdf:range, anyway)
g. Like all these simplified proposals, it doesn't provide any way to 'declare' datatype urirefs, and relies on automagical recognition.

I've taken out the doublet idiom. This could be put back if the WG wants it, however.

The new stuff is all contained in section 5 and the MT.

Pat Hayes 2/23/2002

-------------------------------------------------------

1. Literals

In RDF, urirefs and blank nodes are both considered to be referring expressions; they are used to denote resources. Literals however are best thought of simply as syntactic 'labels' which indicate a lexical form. These lexical forms can be used to restrict the references of other nodes by using datatype schemes, but this use is optional. If a literal is used as a referring expression, it always refers to itself - that is, to a character string - so that a triple of the form

Jenny ex:age "35" .

In RDF/XML:

   <rdf:Description rdf:about="#Jenny">
      <ex:age>35</ex:age>
   </rdf:Description>

states that the value of the property called ex:age on the subject Jenny is the two-character string '35'. Note that it does not say that the value is the number thirty-five.There is no way to modify the meaning of a literal node.

An example of such 'in-line' use of a literal to denote a string is provided by dc:title in the Dublin Core.

2. Datatypes

If the intended meaning of literals is understood by a set of users or applications, then the simple use case illustrated by the above example could be sufficient. This 'untyped' kind of usage is always available in RDF. However, RDF also provides ways to use datatypes to assert that a literal should be interpreted in a particular way.

A datatype is defined abstractly by two domains, one of lexical forms and one of values, and a mapping from lexical forms to values. We assume that a datatype is indicated by a URI, and that some external mechanism is able to access and make use of appropriate representations of the domains and map when supplied with the URI.The model theory is stated in terms of a global function L2V from datatypes to the lexical-to-value mapping of that datatype. In the examples below, urirefs which are being interpreted as datatype names will be indicated by the use of the color green.

3. Datatype triples

The simplest way to talk about the value of a literal under a datatype mapping is to provide a node to denote the value and link that node to the datatype, using the name of the datatype as the property. This is called a datatype triple. For example

Jimmy ex:age _:x .
_:x xsd:integer "35" .

In RDF/XML:

   <rdf:Description rdf:about="#Jimmy">
      <ex:age>
         <rdf:Description>
            <xsd:integer>35</xsd:integer>
         </rdf:Description>
      </ex:age>
   </rdf:Description>

says that Jimmy's age is the value of the literal under the datatype mapping xsd:integer, i.e. that Jimmy's age is the number 35. (Contrast this with the example in the previous section.) The datatype triple also, incidentally, asserts that the literal itself is in the lexical space of the datatype. For example,

_:x xsd:integer "HumptyDumpty" .

In RDF/XML:

   
   <rdf:Description rdf:about="#Jimmy">
      <ex:age>
         
         <rdf:Description>
            <xsd:integer>HumptyDumpty</xsd:integer>
         </rdf:Description>
         
      </ex:age>
   </rdf:Description>

would always be false, no matter what value is assigned to the bnode. This is the only way in which an RDF triple can be contradictory.

A datatype triple is true when the literal is a well-formed lexical form of the datatype, and the subject denotes the value of the lexical form under that datatype's lexical-to-value mapping. The intuitive reading might be "..can be described, according to this datatype mapping, by the character string..".

(This is 'backwards' from the usual way of thinking about a datatype mapping as applying to the lexical form and resulting in the value; the reason for this is simply the RDF syntactic convention that prohibits literals in subject position.Technically, the RDF datatype property is in fact the inverse of the datatype's lexical-to-value mapping; the lexical-to-value mapping goes 'from' the object of the triple 'to' the subject.)

3.1 Datatype properties are a local constraint on literals.

The datatype triple is the most 'local' style of literal datatyping in RDF; the interpretation imposed on the subject node by the datatype property is entirely 'inside' the triple. This means for example that the same literal can be used simultaneously in two different such triples, imposing different interpretations on two different nodes.

For example, if ex:octalnumber were a datatype property, then as well as using the literal as a decimal to indicate Jennys age, we could also assert

Judy ex:age _:y .
_:y ex:octalnumber "35" .

to assert that Judy's age was 29, and both uses of the literal could be in the same RDF graph. Although the two bnodes _:x and _:y denote distinct values, the literal itself has the same meaning in both cases - the lexical form.

Similarly, two different literal representations of the same value could be specified using two different datatype triples which include the same subject:

_:y ex:USdecimal "12.25" .
_:y ex:germandecimal "12,25" .

Obviously, this only works when the literals do in fact map to the same value under the respective mappings.

3.2 Datatype properties have exact domains and ranges.

We make one additional assumption concerning the use of datatype properties: they have exact domains and ranges.

Normally in RDFS, an assertion about a range:

ppp rdfs:range ccc .

In RDF/XML:

   <rdf:Description rdf:about="#ppp">
      <rdfs:range rdf:resource="#ccc"/>
   </rdf:Description>

is understood to say that the precise range of ppp is a subset of the class ccc. This allows RDFS to combine multiple range assertions coherently and reflects the fact that the language has no way to express a 'lower bound' on the membership in a class. However, we will assume that for datatype properties, such an assertion is true only when ccc is the exact range of the property, no more and no less. This exact range is the lexical space of the datatype, so:

ppp rdfs:range ccc .

In RDF/XML:

   <rdf:Description rdf:about="#ppp">
      <rdfs:range rdf:resource="#ccc"/>
   </rdf:Description>

asserts that the class ccc is precisely the set of lexical forms that are acceptable to the datatype ppp.

4. Missing datatype information: rdfs:dlex

Sometimes one wishes to associate a literal with a value without specifying a particular datatype. RDFS provides a special property for this kind of underdetermined association, called rdfs:dlex (read: Datatype LEXical form). The triple

_:x rdfs:dlex "37" .

asserts simply that _:x is a value which can be represented by the character string under some possible datatype mapping. This does not in itself 'fix' the value, of course, but it can be used as a way of making the association between the value and a lexical form explicit, for later use or amplification. We will call this a lexical form triple. A useful way to think of the meaning of rdfs:dlex is: "..can be described by the character string.."

Notice that since rdfs:dlex is not a datatype, it can be used to link several different literals to the same node:

_:x rdfs:dlex "37" .
_:x rdfs:dlex "29" .

However, this should be done with caution, as this usage may conflict with the technique described next.

5. Attaching datatype constraints to a property: rdfs:drange.

It is often convenient to associate a datatype with the range of a property, so that every use of the property can be understood as asserting appropriate datatyping conditions about its object. RDFS provides the special property rdfs:drange for this purpose.(Read as datatype range ; but do not confuse this with rdfs:range, which has quite a different meaning.)

There are two kind of datatype conditions that one might wish to attach to a property, depending on whether the object of the property is a literal, or a value linked to a literal in a lexical form triple.

In the first case, the usual purpose of linking the datatype to the property is to state that the literal in the object position conforms to the lexical conditions of the datatype. For example, we might wish to 'restrict' the property ex:age so that it is used only when applied to numerals, so that

ex:age rdfs:drange xsd:integer
Jenny ex:age "35" .

In RDF/XML:

   <rdf:Description rdf:about="&ex;age">
      <rdfs:drange rdf:resource="&xsd;integer"/>
   </rdf:Description>

   <rdf:Description rdf:about="#Jenny">
      <ex:age>35</ex:age>
   </rdf:Description>

has the same meaning as in section 1, but

ex:age rdfs:drange xsd:integer
Jenny ex:age "HumptyDumpty" .

In RDF/XML:

   <rdf:Description rdf:about="&ex;age">
      <rdfs:drange rdf:resource="&xsd;integer"/>
   </rdf:Description>

   <rdf:Description rdf:about="#Jenny">
      <ex:age>HumptyDumpty</ex:age>
   </rdf:Description>

would be flagged as a datatype violation, by virtue of the association of the datatype with the property. (Note however that this does not assert that the rdfs:range of the property is the class xsd:integer; if it did, then any ex:age triple with a literal subject would be false, even "35".)

The usual intention in the second case, however, is to impose a similar condition on the lexical-to-value mapping used to interpret any lexical form triples containing the object, so that

ex:age rdfs:drange xsd:integer
Jimmy ex:age _:x .
_:x rdfs:dlex "35" .

means that Jimmy's age is the number 35. Here, the datatype is 'projected' across the bnode to impose an interpretation on rdfs:dlex, in effect making the lexical form triple have the same content as a datatype triple.

diagram of effect of rdfs:drange
Figure 1: Datatype conditions imposed by rdfs:drange. The 'blunt' end of the lexical-to-value map always attaches to the literal.

Both of these datatyping restrictions are considered to be part of the meaning of rdfs:drange, and they comprise its total meaning. All it does is to associate datatype restrictions to other property names in these two ways. If the object of an rdfs:drange triple is not a datatype, then the triple is vacuous, and makes no assertion at all.

In particular, a rdfs:drange assertion places no restrictions on the rdfs:range of the property. Although it would often be natural to consider the range of the property to be the lexical space of the datatype in the first case, and the value space of the datatype in the second, this should be asserted separately if the user wishes to make it explicit.

We note that this convention uses datatype urirefs both as properties and as class names. This is quite legal in RDF, and indeed there is a basic assumption which relates the two uses: the datatype class names the value space of the datatype, which is the domain of the datatype property (recall that properties are 'backwards' lexical-to-value maps) ; so the following is true for any datatype ddd:

ddd rdfs:domain ddd .

In RDF/XML:

   <rdf:Description rdf:about="#ddd">
      <rdfs:domain rdf:resource="#ddd"/>
   </rdf:Description>

To refer to the lexical domain, use rdfs:range applied to the datatype property. For example, the following two triples would restrict the rdfs:range of ex:age to be a subset of the lexical space of the datatype:

xsd:integer rdfs:range _:x .
ex:age rdfs:range _:x .

In RDF/XML:

   <rdf:Description rdf:about="&xsd;integer">
      <rdfs:range rdf:resource="#x"/>
   </rdf:Description>

   <rdf:Description rdf:about="&ex;age">
      <rdfs:range rdf:resource="#x"/>
   </rdf:Description>

and would therefore be suitable for use with the 'in-line' idiom used in section 1 above; while

ex:age rdfs:range xsd:integer .

In RDF/XML:

   <rdf:Description rdf:about="&ex;age">
      <rdfs:range rdf:resource="&xsd;integer"/>
   </rdf:Description>

asserts that the range of the property is restricted to the value space of the datatype, so would be suitable for use with the ~~lexical triple or~~ datatype triple idioms. However, to reiterate, the same rdfs:drange assertions would be appropriate in either case.

5.1 rdfs:drange is graph-wide in scope, so can produce clashes.

These extra datatype interpretations imposed on a property by rdfs:drange apply to any such usage of the property anywhere in the RDF graph, so an rdfs:drange assertion has a much wider 'scope' than a datatyping triple, and therefore needs to be used with care. For example,

if several different literals are linked to a single node, then long-range datatyping can produce a conflict:

ex:age rdfs:drange xsd:integer .

Jenny ex:age _:x .
_:x rdfs:dlex "37" .
_:x rdfs:dlex "29" .

The blank node here is required by the datatype triple to have two distinct values at the same time. This situation is called a datatype clash, and is best avoided.

Similarly,

if two different rdfs:drange assertions are made about the same property, then they both apply to it. E.g.

In RDF/XML:

   <rdf:Description rdf:about="&ex;age">
      <rdfs:drange rdf:resource="&xsd;integer"/>
   </rdf:Description>

   <rdf:Description rdf:about="&ex;age">
      <rdfs:drange rdf:resource="&xsd;duration"/>
   </rdf:Description>

If the relevant datatypes have disjoint lexical spaces, or if their lexical-to-value maps fail to give the same values to a lexical form, then any use of the property with a literal is likely to produce a datatype clash. This requires particular care when merging information from different graphs which may have been written with different, and incompatible, conventions about literal datatyping.

5.2 Avoiding datatype clashes

~~Unless you are sure that the datatypes in use will not produce clashes, never use rdfs:dlex with two different literals on the same node.~~

One technique to resolve larger-range clashes is to re-label the properties. Suppose for example that an RDF graph contains

ex:age rdfs:drange xsd:integer .

In RDF/XML:

   <rdf:Description rdf:about="&ex;age">
      <rdfs:drange rdf:resource="&xsd;integer"/>
   </rdf:Description>

and we wish to add some information from another graph which uses a conflicting datatype convention:

ex:age rdfs:drange xsd:string .

In RDF/XML:

   <rdf:Description rdf:about="&ex;age">
      <rdfs:drange rdf:resource="&xsd;string"/>
   </rdf:Description>

To do so, introduce two new property names, say ex:age1 and ex:age2, transcribe all occurrences of ex:age from one graph into one of these and all occurrences from the other graph into the other, and then add:

ex:age1 rdfs:subPropertyOf ex:age .
ex:age2 rdfs:subPropertyOf ex:age .

In RDF/XML:

   <rdf:Description rdf:about="&ex;age1">
      <rdfs:subPropertyOf rdf:resource="&ex;age"/>
   </rdf:Description>

   <rdf:Description rdf:about="&ex;age2">
      <rdfs:subPropertyOf rdf:resource="&ex;age"/>
   </rdf:Description>

This gives

ex:age1 rdfs:drange xsd:integer .
ex:age2 rdfs:drange xsd:string .

In RDF/XML:

   <rdf:Description rdf:about="&ex;age1">
      <rdfs:drange rdf:resource="&xsd;integer"/>
   </rdf:Description>

   <rdf:Description rdf:about="&ex;age2">
      <rdfs:drange rdf:resource="&xsd;string"/>
   </rdf:Description>

which does not produce any datatype clashes, retains both particular ways of imposing meanings on literals - since these restrictions are associated with the particular property name - and still allows all RDFS conclusions using the original ex:age property to be drawn from the information in either of the graphs. This trick works because datatyping constraints are not inherited 'upwards' through subproperty relationships; similarly, a superclass of a datatype class need not itself be a datatype class.

6. Model theory

(We assume that the basic MT has tidy literal nodes and that I("lll") = lll for any literal under any interpretation I. We don't need to mention LV.)

Suppose I is an RDFS interpretation of a graph E. Then I is datatyped (with respect to a set D of datatypes) if the following is true for any datatype uriref ddd (with I(ddd) in D):

(1) IEXT(I(ddd)) = {<y,x> : y = L2V(I(ddd))(x) } ie the inverse of the datatype lexical-to-value map.

(2) ICEXT(I(ddd)) = {x : <x,y> in IEXT(I(ddd)) } ie the value space of the datatype.

(3) For any literal lll, if E contains

aaa rdfs:drange ddd .
bbb aaa "lll" .

In RDF/XML:

   <rdf:Description rdf:about="#aaa">
      <rdfs:drange rdf:resource="#ddd"/>
   </rdf:Description>

   <rdf:Description rdf:about="#bbb">
      <aaa>lll<aaa>
   </rdf:Description>

then L2V(I(ddd))(lll) is defined, ie lll is in the lexical space of I(ddd).

(4) For any literal lll, if E contains

aaa rdfs:drange ddd .
bbb aaa ccc .
ccc rdfs:dlex "lll" .

then I(ccc) = L2V(I(ddd))(lll) ie the 'dlex' is restricted to have the same meaning as the datatype property.

We can capture the content of the fourth condition by a special closure rule which inserts the appropriate datatyping triple, as in the first row of the following table of closure rules:

If the graph contains:	then add the triple:
aaa rdfs:drange ddd . bbb aaa ccc . ccc rdfs:dlex "lll" .	~~ccc ddd "lll" .~~
	ddd rdfs:domain ddd .

However, the meaning of the other semantic conditions cannot be fully captured by closures.