The "Typed Data Literal" (TDL) Datatyping Scheme

Abstract

This document describes the "Typed Data Literal" (TDL) datatyping scheme (also known as "PD") which is one of the candidate proposals being discussed by the RDF Core Working Group.

Status of this Document

The document has no normative status and merely provides a reference for an ongoing discussion within the working group.

Authors

Jeremy Carroll
Patrick Stickler

Many working group members and members of RDF Interest have helped to shape this document.

1 Introduction

2 Definition

2.1 Overview

2.2 An Introduction to the Model Theory for TDL

3 Representation of Typed Data Literals in RDF

3.1 The rdf:value+rdf:type Local Idiom

3.2 The rdfs:range Global Idiom

3.3 Compatability Between Idioms

4 Satisfaction of Desiderada

Appendix A: The Model Theory for TDL

Appendix B: Union Datatypes

References

1 Introduction

This document describes the "Typed Data Literal" (TDL) datatyping scheme, which is one of several proposals under consideration by the RDF Core Working Group (hereafter referred to simply as WG) for achieving a total solution for datatyping based on the foundational RDF Datatyping Model [RDF DT] which is itself defined in terms of the RDF Model Theory [RDF MT].

The TDL scheme, also known as "PDU" or "PD", is a fusion of the idioms from two earlier schemes "P" and "D" (or "DAML") along with the conceptual model from "U" (omitting the URV based idiom). When type information is omitted the Model Theory for TDL captures the ambiguous typing of the Perl programming idiom [PL].

C.f.

The formal treatment of TDL is presented as a modification to the RDF Model Theory [RDF MT]. Datatyping is achieved during interpretation. Each occurrence of a literal Unicode string may have its own node in the graph and is interpreted according to the map(s) associated with the datatype(s) associated by TDL with that node. The graph may be ill-formed because of datatyping problems (e.g. "three" is not an integer). The informal intent of TDL is to capture the normal programming paradigm that the input syntax uses the lexical space of datatypes, and the "meaning" is in the value space of the datatype. However, for technical reasons (mainly that the typing in RDF MT is part of the model rather than the interpretation), the interpretation of each Unicode string node in the graph is given as a lexical-value pair within the Universe, which most of the time is treated as being the value component. For Unicode string nodes with no datatype information, or whose datatype is not supported, the lexical component of the pair is more significant. As always, the intent of the Model Theory is to capture concepts such as entailement, consistency etc. but not to indicate an approach to implementation. In particular, the existence of lexical-value pairs within the Universe of Interpretation is not intended to indicate a deep metaphysical belief in such things!

2 Definition

2.1 Overview

As defined in section 2 of [RDF DT], for any given member of a lexical space there exists a mapping to one and only one member of the value space, referred to as the datatype mapping. Likewise, for any given member of a canonical lexical space there exists a mapping to one and only one member of the value space, referred to as the canonical mapping. Because the unique and unambigous identity of the lexical, canonical, and value spaces are inherent in the identity of the datatype itself, by the very definition of a datatype, we may uniquely and unambiguously denote a specific datatype mapping or canonical mapping, and hence a specific value, simply by the pairing of a lexical form (member of the lexical space) with the identity of the datatype (which in the case of RDF is a URI Reference).

[Definition:] The pairing of a lexical form to a datatype identity is called a typed data literal (TDL).

If the lexical form is a member of a canonical lexical space, the TDL denotes both a lexical mapping as well as a canonical mapping. Though, for the purpose of mapping a lexical form to a value, any canonical mapping is superfluous and redundant as the existence of a given canonical mapping infers the existence of a datatype mapping having the same pair of lexical form and value members.

Example

A TDL uniquely denotes a member of the value space of the datatype because there is a one-to-one correspondence between TDL pairings and datatype mappings:

2.2 An Introduction to the Model Theory for TDL

TDL is formalized as changes to the existing RDF Model Theory.
This section gives a light-weight overview, the interested reader should read Appendix A for the full detail. XML Schema Union datatypes are omitted from this section; see Appendix B for how they are addressed.
Datatypes are viewed as in Patel-Schneider's work [SWOL]. That is each datatype has four components, a URI, a lexical space, a value space, and a mapping.
An RDF interpretation is with respect to some set of datatypes, which corresponds to the supported datatypes in an RDF implementation. An implementation is free to not support datatyping, in which case the set of datatypes is empty.

Terminology

We modify the terminology of the Model Theory to differentiate between literals before datatyping and literals after datatyping. The modification is:

We use the term "Unicode node" to refer to a node in the graph labelled with a unicode string.
We use the term literal-value pair to refer to a pair consisting of a unicode string and a 'typed value'. The interesting literal-value pairs are ones that belong to the mapping of some datatype.
We do not use terminology such as "literal node" or "literal value".
We refer to the set of datatypes used in an RDF interpretation as the "supported datatypes".

The Interpretation of Datatype Classes

In RDF, classes can be thought of (informally) as corresponding to sets. In this case rdf:type can be thought of as corresponding to set membership. The model theory of various proposed datatyping mechanisms can be contrasted as to which set a datatype then corresponds to.

In this proposal a datatype class corresponds to its map, a set of pairs of lexical strings and their corresponding values.

The Interpretation of Unicode Nodes

An interpretation maps each Unicode node to some literal-value pair. The unicode string component is given by the label on the node. The type information is checked by requiring this pair to be a member of each class associated with this node (e.g. by a range constraint). As above class membership of datatype classes refers to the map of the datatype. Note that for technical reasons the 'typed value' of the interpretation of untyped Unicode nodes is unrestricted, i.e. there is no default type.

The Interpretation of rdf:value

Following Graham Klyne's suggestion rdf:value is simply equality.

The Interpretation of Asserted Triples

These changes to the model theory can be seen as changes in the interpretation of triples.
Those with predicate rdf:value or rdf:type are both treated specially: rdf:value as equality, and rdf:type knows the supported datatypes and treats them essentially as the map of the datatype (i.e. <s, rdf:type, d> iff I(s) is a literal-value pair in the map of d).
For other triples the model theory is unchanged, although in the Universe of interpretation the old literal values are now represented as literal value pairs, and hence the representation of triples with literal objects is slightly different.

Multiple types

A literal-value pair may belong to multiple types, in which case a legal RDF graph may show multiple type information for that literal-value pair, using both the local or the global idioms. Sometimes the intersection of multiple types may be surprisingly small but not empty, for example, a binary integer type and a positive decimal integer type may have intersection { ("0",0), ("1",1) }; either of these two literal-values would be legal, but a Unicode string "10" cannot be interpreted in the presence of such conflicting type information, despite being in both lexical spaces and despite the two value spaces being the same. (Contrast with S-B, which permits "10" in such a case).

Unsupported Datatypes

An RDF implementation only knows some datatypes, and in particular may not be aware of a datatype used in a particular RDF document. The model theory reflects this by having an interpretation with respect to some set of datatypes (the supported datatypes). In practice, documents with an unsupported datatype constrain the datatype (in that the lexical occurrences in the document must be in the lexical space of the datatype), whereas supported datatypes constrain the document (in that the document may be ill-formed in that the unicode nodes are labelled with strings that are not in the domain of the relevant datatypes). The model theory is monotone with respect to the set of supported datatypes; meaning that implementations supporting fewer datatypes will make correct inferences but not all inferences. (e.g. they will not infer a contradiction when datatyping is invalid).

3 Representation of Typed Data Literals in RDF

A TDL may be defined in several ways in RDF, according to the particular idiom used. This proposal outlines two such idioms for defining TDL pairings, one for global (implicit) definitions and one for local (explicit) definitions. Each idiom is defined separately below.

Note: For the sake of brevity and clarity, qualified names are used in the examples provided in this section where normally URI References are required. The following namespace declarations are assumed in the examples:

   xmlns:rdf  ="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:rdfs ="http://www.w3.org/2000/01/rdf-schema#"
   xmlns:xsd  ="http://www.w3.org/2001/XMLSchema#"
   xmlns:ex   ="uuid:f82dad84-0a58-11d6-9542-0003931df47c/"

3.1 The rdf:value+rdf:type Local Idiom

The rdf:value+rdf:type idiom provides a means to explicitly associate a datatype with a literal value by the use of an anonymous node for which the properties rdf:value and rdf:type are defined. The property rdf:value takes the literal (lexical form) as its object and the property rdf:type takes the URI Reference of the datatype.

Example

Per the statements below, the lexical form "30" is explicitly declared to be a member of the lexical space of the datatype 'xsd:integer':

Model Theoretic Interpretation of Local Idiom

The interpretation of the blank node which is subject of the rdf:value is constrained to be the same as the interpretation of the object of the rdf:value, the unicode node. This is because rdf:value is interpreted as the identity. Moreovoer this literal-value pair is required to be a mapping in the datatype by the interpretation of the rdf:type edge.

In the example, the "30" is interpreted as < "30", x > for some x, this same pair is also the interpretation of the blank node, which by the rdf:type constraint lies in the mapping of xsd:integer. Hence x is the integer 30.

3.2 The rdfs:range Global Idiom

The rdfs:range idiom utilizes the RDF Schema [RDF Schema] rdfs:range property to define an implicit intersection of one or more lexical data types, which may be used to imply or constrain the datatype(s) of a literal.

Example

Per the following RDF statements, the lexical form "30" is implied (or required) to be a member of the lexical space of the datatype 'xsd:integer':

Whether the rdfs:range statement constitutes a constraint on the allowed datatypes depends on whether there exists any local (explicit) type assignment. If there is no local typing for the literal value whatsoever, then rdfs:range can only serve as a global (implicit) type assignment. However, if the literal has one or more types defined locally, and any locally specified datatype is not compatible with all datatypes globally implied by rdfs:range for the property, one can treat such a case as a contradition to a constraint on the expected or required datatype(s) for the property in question.

Model Theoretic Interpretation of Global Idiom

The normal rdfs:range mechanism constrains the type of the object of the relevant property. When the object string and the type is a supported datatype this then is the global datatyping mechanism.

In the example the Unicode node will have interpretation < "30", x > for some x. Without the schema information, in RDF, any x is permitted. The rdfs:range is only relevant in RDFS. In RDFS, the range constraint applies, and all valid interpretations will have < "30", x > as being in the class extension of xsd:integer. This class extension is systematically understood as referring to the map, rather than the lexical or the value space, and thus x must be 30.

3.3 Compatability Between Idioms

It is essential that both global (implicit) and local (explicit) idioms be able to coexist within the same knowledge base without undesired interactions -- and in fact, this is essential if a global idiom is to be used as a constraint on locally defined datatypes. The rdfs:range and rdf:value+rdf:type idioms are fully compatable and can cohabit the same knowledgebase freely.

Example

Cohabitation of global and local idioms:

4 Satisfaction of Desiderada

The official desiderada for all proposed datatyping solutions is defined in [RDF Desiderada].

This section clarifies how each desiderada is satisfied by this proposal. The list of desiderada is taken verbatim from the aforementioned document. Clarifications are in italics.

The TDL proposal meets all of the defined desiderata.

Backward compatibility
- with existing RDF data
- with existing RDF code
- with existing RDF-based specifications like DAML+OIL or CC/PP
TDL is fully backwards compatible with all known systems and idioms insofar as it does not require modification to the present RDF graph model, does not require modification to the present XML serialization, adopts the idioms presently used by DAML+OIL, and (insofar as can be determined from the official materials) is compatable with the typing idioms employed by CC/PP.
The model theory explicitly covers the old case of supporting no datatypes, and behaves monotonically as new datatypes are added.
In as much as existing practice allows user typing of untyped literals (as in the PL propoal [PL] and the Jena (v1.3) system), the model theory respects that, in that untyped literals can be understood as having any typed value.
Ability to use built-in primitive XML Schema datatypes

TDL allows the use of any descendant of the XML Schema type "anySimpleType", both the predefined types as well as all custom types. This does not mean that every application will support the interpretation or validation of values associated with those types, but that all values of such types can be denoted in RDF by a TDL pairing.
Ability to use non-XML-Schema datatypes

TDL allows the use of any lexical datatype, conforming to the definition given here and in reference documents to that end, and which has URI denotation. This does not mean that every application will support the interpretation or validation of values associated with those types, but that all values of such types can be denoted in RDF by a TDL pairing.
Ability to define datatypes using schema languages rather than relying on "built-in" data types.

This is considered to be addressed in #3 above as well as by the default interpretation of non-typed literals.
Ability to represent type information without an associated RDF schema

The TDL local/explicit idiom provides for the representation of TDL pairings, and thus the typing of literal values, without any need to reference an external schema to determine typing of literals.
Ability to reference type information in an associated RDF schema

The TDL global/implicit idiom provides for the representation of TDL pairings, and thus the typing of literal values, to be encoded in one or more external schemas to imply typing of literals and/or constraints on the typing of locally typed literals.
Co-existence of "global" and "local" typing mechanisms

The TDL idioms for global and local typing are fully compatable and may coexist freely in the same knowledge base without undesirable interaction.
Provide account of datatyping scheme semantics

The TDL proposal provides a full account of datatyping semantics.
Support for existing data typing idioms

This is considered to be addressed in #1 above.

Appendix A: The Model Theory for TDL

Datatypes are viewed as in Patel-Schneider's work [SWOL]. That is each datatype d has four components:

u(d): the URI reference
L(d): the lexical space (subset of the se of Unicode strings)
V(d): the value space,
M(d): a subset of L(d) x V(d), such that there is at least one pair in M(d) for each string of L(d), and at least one pair in M(d) for each value in V(d).

Unlike previous work, the mapping is a relationship rather than a function. This is specifically to accomodate XML Schema Union datatypes. A full discussion of these is found in the next appendix. For all other datatypes the mapping is a function. Each datatype is a resource and is found in the Universe of interpretation.
An RDF interpretation is with respect to some possibly empty set, DT, of datatypes. DT is a subset of IR, the set of resources.
We use a set IR of resources, the set of U of Unicode strings and a set VL of values. V(d) is a subset of VL for every d in DT. The Universe is IR union ( U x VL )