RDF Datatyping

Abstract

This document summarizes the proposals for RDF datatyping that are currently considered by the RDF Core Working Group (further referred to as WG).

Status of this Document

The document has no normative status and merely provides a reference for an ongoing discussion within the WG.

Contributors

This document includes contributions of almost all members of the WG, in particular those provided by

Many other WG members not listed above have helped to shape this document.

Table of Contents

(@@@ to be completed when the content and section numbers stabilize)

Scope

The RDF Core Working Group is not chartered to develop a separate data typing language that duplicates facilities provided by XML Schema data types (see RDF Core WG Charter).

Desiderata for RDF Datatyping

(@@@ in no particular order)

Deliverables of RDF Datatyping

Type System

The conceptual framework for datatyping presented in this document is based on the type system defined in the "XML Schema Part 2: Datatypes" [XSD]. This section explains how the relevant terms and concepts defined in [XSD] are expressed using the model-theoretic semantics for RDF defined in the "RDF Model Theory Working Draft" [RDF MT].

Datatype mapping

[XSD] defines a datatype as a 3-tuple, consisting of a) a set of distinct values, called its value space, b) a set of lexical representations, called its lexical space, and c) a set of facets that characterize properties of the value space, individual values or lexical terms. [XSD] implicitly assumes a fourth component, which we call datatype mapping, to be part of the datatype.

[Definition:]  A datatype mapping is a set of pairs whose first element belongs to the value space of the datatype, and the second element belongs to the lexical space of the datatype. A datatype mapping satisfies the following properties:

  1. Each element of the lexical space maps to exactly one element of the value space.
  2. Each element of the value space has at least one lexical representation.

(@@@ is the second condition necessary? Should we distinguish between partial and complete datatype mappings?)

Example
Datatype mapping for a datatype "boolean". Each element of the value space has two lexical representations.
Value space: {T, F}
Lexical space: {"0", "1", "true", "false"}
Datatype mapping: {<T, "true">, <T, "1">, <F, "0">, <F, "false">}

Canonical datatype mapping

As specified in [XSD], a canonical lexical representation is a set of elements from the lexical space of a datatype such that there is a one-to-one mapping between elements in the canonical lexical representation and elements in the value space. This mapping is referred to as canonical datatype mapping.

[Definition:]   A canonical datatype mapping is a subset of a datatype mapping that establishes a one-to-one correspondence between elements in the canonical lexical representation and elements in the value space.

Example
A canonical datatype mapping for the datatype "boolean" of previous example.
Canonical datatype mapping: {<T, "true">, <F, "false">}

Datatyping schemes

[Definition:]  A datatyping scheme is a convention for representing and using datatypes in RDF. A datatyping scheme describes how are represented in RDF graphs and interpreted using model-theoretic semantics.

[RDF MT] explains the fundamental model-theoretic concepts like interpretation, universe, extension etc. used for interpreting the semantics of RDF graphs. This document assumes familiarity with these basic concepts.

Facets

Specification and interpretation of datatype facets is out of scope of this document.

Datatyping scheme "S"

This section explains how datatyping is introduced in the so-called "S" scheme, one of the candidate suggestions that has been discussed by the WG.

In accordance with [RDF MT], the primary RDF syntax used in the "S" scheme is based on tidy graphs (a tidy graph is the one in which no two nodes carry the same label). The interpretation of each literal is assumed fixed and determined by its content. (For example, the interpretation of literals could be defined as an identity mapping.)

Datatypes in a model-theoretic interpretation

A datatype mapping is considered to be a binary relational extension that exists in a model-theoretic interpretation. Both the value space of the datatype and its lexical space are subsets of the universe used in the interpretation.

Example
Datatype mapping of datatype boolean in a model-theoretic interpretation. The colored dots represent entities in the universe. The solid lines represent pairs in the datatype mapping.

(@@@ this document does not describe what the extensions of datatypes themselves are in a model-theoretic interpretation.)

Representation of datatype mappings

A datatype mapping can be "named" using RDF properties. In this document, such properties are referred to as datatype properties. To associate a datatype property with a certain datatype, the extension of the datatype property is defined to be the datatype mapping that belongs to the datatype.

Example
Extension of property xsd:boolean.map is defined explicitly as a datatype mapping. T and F are distinct elements in the universe, I is the interpretation function.
IEXT(I(xsd:boolean.map)) := {<T, I("true")>, <T, I("1")>, <F, I("0")>, <F, I("false")>}

Representation of value spaces and lexical spaces

Since value spaces and lexical spaces are subsets of the elements in the universe, they can viewed as class extensions and can be referred to in RDF graphs by means of resources that identify classes.

Example
Class extensions of resources xsd:boolean.val and xsd:boolean.lex are defined explicitly.
CEXT(I(xsd:boolean.lex)) := {I("true"), I("1"), I("0"), I("false")}
CEXT(I(xsd:boolean.val)) := {T, F}

Representation of elements of value spaces and lexical spaces

In RDF graphs, literals can be used to refer to the elements of the lexical spaces of datatypes. Using a combination of datatype properties and literals, it is possible to refer to data values of datatypes.

Example
Datatype property xsd:boolean.map is used to constrain the interpretation of the blank nodes shown at the top of the figure. At the bottom of the figure, a possible valid interpretation is shown. Since the semantics of literals and that of property xsd:boolean.map are fixed, both blank nodes denote exactly one element of the universe (element T) in each valid interpretation of the graph. The literals true and 1 denote the corresponding elements of the lexical space of the datatype boolean.

The elements of value spaces can be "named" explicitly using URI references. For example, a resource with URI reference like xsd:boolean.value.T could be used to denote the element T of the value space of xsd:boolean. This document does not suggest any such explicit identifiers for the elements of value spaces.

XML Schema datatypes

This section explains how built-in atomic XML Schema datatypes can be used in the datatyping scheme "S". Non-atomic types (like IDREFS) are out of scope of this document.

[XSD] specifies a unique URI reference for each built-in datatype that is defines. For example, the URI

is used to address the datatype int. [XSD] suggests that the components of the datatypes (specifically, datatype facets) be addressed using URIs constructed by appending "." and the name of the component to the URI of the datatype.

This document proposes that the identifiers for the lexical spaces, value spaces, datatype mappings and canonical datatype mappings be constructed following the above principle. For example:

This document proposes a fixed interpretation of resources with URI references like the ones listed above that correspond to the following XSD datatypes:

Built-in primitive types
duration, dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay, gMonth, boolean, base64Binary, hexBinary, float, double, anyURI, QName, NOTATION, string, decimal
Built-in derived types
integer, nonPositiveInteger, nonNegativeInteger, negativeInteger, positiveInteger, int, long, short, byte, unsignedLong, unsignedInt, unsignedByte, language, Name, NCName, NMTOKEN, ID, IDREF, ENTITY

(@@@ do we need all those XML-specific types like IDREF?)

In this document, the shortcut "xsd:" is used to abbreviate the namespace http://www.w3.org/2001/XMLSchema#.

Definition of datatypes

[XSD] discusses several ways of defining datatypes (e.g., axiomatically, by enumeration, by restriction). This document only considers axiomatic definitions of datatypes. User-defined datatypes and dedicated vocabularies for datatype definition are out of scope of this document.

As illustrated in [XSD] (Sec. 3), the datatype mappings of the derived types can be arranged in a hierarchy. For example, type int is derived (by restriction) from long, which is derived from integer, which is derived from decimal, which is a primitive type.

[Definition:]   Datatype B is derived by restriction from datatype A, if and only if the datatype mapping of B is contained as a subset in the datatype mapping of A.

The built-in derived types of [XSD] satisfy the above definition. A new derived type can be obtained by restricting the lexical space of a datatype. In this case, the datatype mapping of the new type is a range-restricted datatype mapping of the source type.

Relating datatypes and their components

Schema languages like [RDF Schema] can be used to relate different datatype components, e.g. datatype mappings and lexical spaces, to each other. Explicit relationships between datatype components help reduce the amount of built-in semantics that needs to be hard-coded into applications.

Example
Datatype mappings can be related to the corresponding value and lexical spaces using [RDF Schema] properties rdfs:domain and rdfs:range.
xsd:decimal.map rdfs:range  xsd:decimal.val
xsd:decimal.map rdfs:domain xsd:decimal.lex

Notice that according to [RDF MT], properties rdfs:domain and rdfs:range define a subset relationship between the range of a property extension and the corresponding class extension. In other words, rdfs:domain and rdfs:range alone are not sufficient to define precisely the interpretations of xsd:decimal.lex and xsd:decimal.val given an axiomatic definition of xsd:decimal.map.

Example
Derived datatypes can be indicated as such explicitly using the property rdfs:subPropertyOf. The same approach can be used to specify that canonical datatype mappings are contained within their respective datatype mappings.
xsd:integer.map rdfs:subPropertyOf xsd:decimal.map
xsd:long.map    rdfs:subPropertyOf xsd:integer.map
xsd:int.map     rdfs:subPropertyOf xsd:long.map
xsd:int.cmap    rdfs:subPropertyOf xsd:int.map

Example
Inclusion hierarchies of value spaces and lexical spaces can be specified explicitly using the property rdfs:subClassOf.
xsd:integer.val rdfs:subClassOf xsd:decimal.val
xsd:long.val    rdfs:subClassOf xsd:integer.val
xsd:int.val     rdfs:subClassOf xsd:long.val

xsd:decimal.lex rdfs:subClassOf rdfs:Literal
xsd:integer.lex rdfs:subClassOf xsd:decimal.lex
xsd:long.lex    rdfs:subClassOf xsd:integer.lex
xsd:int.lex     rdfs:subClassOf xsd:long.lex

Discussion of selected datatypes

xsd:string

The datatype mapping of the datatype string is interpreted as an identity mapping.

Example
The datatype property xsd:string.map can be used to refer to literals as if they were in the subject position. The blank node in the left graph denotes the same entity as the literal "P1Y" in each valid interpretation. (@@@ this is another argument for allowing literals as subjects...)

xsd:base64Binary and xsd:hexBinary

The value space of xsd:base64Binary is equivalent to the value space of xsd:hexBinary and is the set of finite-length sequences of binary octets. The value space of xsd:string are sequences of characters. That is, xsd:base64Binary.val is disjoint from xsd:string.val.

Example
Datatype property xsd:base64Binary.map (or xsd:hexBinary.map) can be used to refer to arbitrary binary data in RDF graphs. Apparently, there is no mechanism in [XSD] that bridges the gap between sequences of octets and sequences of characters. In the figure, property octectsToChars fulfils this role. If xsd:string.map is used instead of octetsToChars, the graph has no valid interpretation.

Modeling styles

The "S" scheme supports two distinct ways of using typed values in RDF graphs, or two different "idioms". This document does not prescribe which of the idioms should be used in RDF applications. The following subsections illustrate and compare these two idioms.

Idiom A ("advanced")

In Idiom A, the elements of value spaces of datatypes are used for representing typed data elements.

Example
The rdfs:range of property exA:birthdate is defined as the value space of the datatype date. Jenny's birthdate is July 15, 2001.

Idiom B ("backward compatible")

In Idiom B, the elements of lexical spaces of datatypes are used for representing typed data elements.

Example
The rdfs:range of property exB:birthdate is defined as the lexical space of the datatype date. Jenny's birthdate is July 15, 2001.

Frequently, interpretations of literals belong to lexical spaces of several datatypes. For example, the interpretation I("10") of literal "10" is both an element of the lexical space of xsd:string and the element of the lexical space of xsd:integer.

In Idiom B, schema information (e.g., specified using [RDF Schema]) provides a hint for the validation and usage of the literals. Use of Idiom B is akin to type handling in programming languages like Perl. In this perspective, literals correspond to scalars, which are typecast depending on the input/output type of operations (see [PL] for a detailed discussion).

Idiom A versus Idiom B

Many existing RDF applications deploy Idiom B. The major advantages of Idiom B are backward compatibility and compactness. The fact that Idiom B utilizes the elements of lexical spaces rather than the elements of value spaces is unimportant for most applications.

Use of Idiom A is advantageous for evolving applications, especially those that need to interoperate with other applications. Idiom A supports multiple lexical representations for a given data value in an RDF graph. This feature facilitates migration and parallel use of alternative lexical representations (e.g., encoding "2001-07-15" can supercede "July 15, 2001" without breaking the existing applications).

Another feature of Idiom A is enforcement of local typing, i.e., the typing information always accompanies the data instances in RDF graphs. This feature makes data instances more robust with respect to incompatible changes in schemas.

Open issues

@@@ Unit System (likely out of scope)

Measures like mass, duration, or monetary value are used in a variety of applications. Measures are quite similar to datatypes. In fact, [XSD] defines durations as datatypes, i.e., a lexical token like "P1Y" is mapped to a duration of one year using a datatype mapping. Many measures like mass, volume or monetary values are expressed using quantities of units like kilograms, gallons, or US dollars.

This section defines a unit system and illustrates its use in the "S" scheme. Just like lexical tokens are mapped to typed values using datatype mappings, numbers are mapped to values of measure using unit mappings.

[Definition:]   A unit type is a 3-tuple, consisting of a) a set of distinct values, called its measure space, b) a set of numbers, called its numeric space, and c) a one-to-one mapping between the numeric space and the measure space called its unit mapping.

(@@@ what organizations standardize measures? NIST, ISO, DIN? Should we define vocabulary for the Metric system? This could be a proper contribution of this document ;-)

Example
The interpretation shown in the figure describes two one-year-old toddlers, Robby and Jenny. Robby weighs 14 kg, whereas Jenny weighs only 10 kg. The interpretation contains 13 elements in the domain of discourse (2 toddlers, 2 masses, 1 duration, 3 real numbers, and 5 strings), which are labeled using identifiers in bold. The wide arrows represent extensions of relationships. The interpretation contains three unit mappings: inKg, inYears, and inMonth, and three datatype mappings: xsd:duration.map, xsd:decimal.map, and inOctal.

(@@@ what needs to be explained? What Ntriples, RDF/XML etc. examples need to be provided? Should the different idioms be illustrated using the above example?)

Graphs

Ntriple

_:Jenny age              _:1
_:1     xsd:duration.map "P1Y"

_:Jenny weight           _:2
_:2     inKg             _:3
_:3     inOctal          "14"

_:Robby age              _:4
_:4     inYears          _:5
_:5     xsd:decimal.map  "1"

_:Robby weight           _:6
_:6     inKg             _:7
_:7     xsd:decimal.map  "14"

RDF/XML

(@@@ is the encoding below correct. Dave?)
<rdf:Description>
  <age xsd:duration.map="P1Y"/>
  <weight rdf:parseType="Resource">
    <inKg inOctal="14"/>
  </weight>
</rdf:Description>

<rdf:Description>
  <age rdf:parseType="Resource">
    <inYears xsd:decimal.map="1"/>
  </age>
  <weight rdf:parseType="Resource">
    <inKg xsd:decimal.map="14"/>
  </weight>
</rdf:Description>

Datatyping scheme "P/P++"

(@@@ incomplete, removed for now)

Datatyping scheme "U"

(@@@ incomplete, removed for now)

References

[CWM]
Object Management Group. Common Warehouse Metamodel 1.0. Feb 2001. Available at: ftp://ftp.omg.org/pub/docs/ad/01-02-01.pdf
[UML]
Object Management Group. Unified Modeling Language 1.4. Sep 2001. Available at: ftp://ftp.omg.org/pub/docs/formal/01-09-67.pdf
[PL]
Dan Connoly. PL: how a PERL programmer might do datatypes in RDF. Available at: http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Dec/0003.html
[RDF Core WG Charter]
W3C RDF Core Working Group Charter. Mar 2001. Available at: http://www.w3.org/2001/sw/RDFCoreWGCharter
[RDF MT]
W3C RDF Model Theory Working Draft. Sep 2001. Available at: http://www.w3.org/TR/2001/WD-rdf-mt-20010925/
[RDF Schema]
W3C RDF Schema Recommendation. ? 200?. Available at: http://www.w3.org/?
[XSD]
World Wide Web Consortium. XML Schema Part 2: Datatypes. Available at: http://www.w3.org/TR/xmlschema-2/

Last modified: Mon Dec 10 12:35:33 PST 2001