This document describes the "Typed Data Literal" (TDL) datatyping scheme (also known as "PD") which is one of the candidate proposals being discussed by the RDF Core Working Group.
The document has no normative status and merely provides a reference for an ongoing discussion within the working group.
Many working group members and members of RDF Interest
have helped to shape this document.
This document describes the "Typed Data Literal" (TDL) datatyping
scheme, which is one of several proposals under consideration by
the RDF Core Working Group (hereafter referred to simply as WG)
for achieving a total solution for datatyping based on the foundational
RDF Datatyping Model [RDF DT] which is itself
defined in terms of the RDF Model Theory
[RDF MT].
The TDL scheme, also known as "PDU" or "PD", is a fusion of the
idioms from two earlier schemes "P" and "D" (or "DAML") along with
the conceptual model from "U" (omitting the URV based idiom).
When type information is omitted the Model Theory for TDL captures
the ambiguous typing of the Perl programming idiom
[PL].
C.f.
The formal treatment of TDL is presented as a modification to the
RDF Model Theory [RDF MT]. Datatyping is
achieved during interpretation. Each occurrence of a literal
Unicode string may have its own node in the graph and is interpreted
according to the map(s) associated with the datatype(s) associated
by TDL with that node. The graph may be ill-formed because of
datatyping problems (e.g. "three" is not an integer). The informal
intent of TDL is to capture the normal programming paradigm that
the input syntax uses the lexical space of datatypes, and the
"meaning" is in the value space of the datatype. However, for
technical reasons (mainly that the typing in RDF MT is part of the
model rather than the interpretation), the interpretation of each
Unicode string node in the graph is given as a lexical-value pair
within the Universe, which most of the time is treated as being
the value component. For Unicode string nodes with no datatype information,
or whose datatype is not supported, the lexical component of the pair is
more significant.
As always, the intent of the Model Theory is
to capture concepts such as entailement, consistency etc. but not
to indicate an approach to implementation. In particular,
the existence of lexical-value pairs within the Universe of
Interpretation is not intended to indicate a deep metaphysical
belief in such things!
As defined in section 2 of [RDF DT], for any
given member of a lexical space there exists a mapping to one and
only one member of the value space, referred to as the datatype
mapping. Likewise, for any given member of a canonical lexical
space there exists a mapping to one and only one member of the value
space, referred to as the canonical mapping. Because the
unique and unambigous identity of the lexical, canonical, and value
spaces are inherent in the identity of the datatype itself, by the
very definition of a datatype, we may uniquely and unambiguously
denote a specific datatype mapping or canonical mapping, and hence
a specific value, simply by the pairing of a lexical form (member
of the lexical space) with the identity of the datatype (which in
the case of RDF is a URI Reference).
[Definition:]
The pairing of a lexical form to a datatype identity is called a
typed data literal (TDL).
If the lexical form is a member of a canonical lexical space, the
TDL denotes both a lexical mapping as well as a canonical mapping.
Though, for the purpose of mapping a lexical form to a value, any
canonical mapping is superfluous and redundant as the existence of
a given canonical mapping infers the existence of a datatype mapping
having the same pair of lexical form and value members.
TDL is formalized as changes to the existing RDF Model Theory.
In RDF, classes can be thought of (informally) as corresponding to sets.
In this case rdf:type can be thought of as corresponding to set membership.
The model theory of various proposed datatyping mechanisms
can be contrasted as to which set a datatype then corresponds to.
In this proposal a datatype class corresponds to its map, a set of pairs of lexical strings
and their corresponding values.
A TDL may be defined in several ways in RDF, according to the
particular idiom used. This proposal outlines two such idioms for
defining TDL pairings, one for global (implicit) definitions
and one for local (explicit) definitions. Each idiom is defined
separately below.
Note: For the sake of brevity and clarity, qualified names are used
in the examples provided in this section where normally URI References
are required. The following namespace declarations are assumed in
the examples:
The rdf:value+rdf:type idiom provides a means to explicitly associate
a datatype with a literal value by the use of an anonymous node
for which the properties rdf:value and rdf:type are defined. The
property rdf:value takes the literal (lexical form) as its object
and the property rdf:type takes the URI Reference of the datatype.
The interpretation of the blank node which is subject of the rdf:value
is constrained
to be the same as the interpretation of the object of the rdf:value, the unicode node.
This is because rdf:value is interpreted as the identity.
Moreovoer this literal-value pair is required to be a mapping in the datatype
by the interpretation of the rdf:type edge.
In the example, the "30" is interpreted as < "30", x > for some x, this same pair
is also the interpretation of the blank node, which by the rdf:type constraint
lies in the mapping of xsd:integer. Hence x is the integer 30.
The rdfs:range idiom utilizes the RDF Schema [RDF Schema] rdfs:range
property to define an implicit intersection of one or more lexical
data types, which may be used to imply or constrain the datatype(s)
of a literal.
Whether the rdfs:range statement constitutes a constraint on the
allowed datatypes depends on whether there exists any local (explicit)
type assignment. If there is no local typing for the literal value
whatsoever, then rdfs:range can only serve as a global (implicit)
type assignment. However, if the literal has one or more types
defined locally, and any locally specified datatype is not compatible
with all datatypes globally implied by rdfs:range for the property,
one can treat such a case as a contradition to a constraint on the
expected or required datatype(s) for the property in question.
The normal rdfs:range mechanism constrains the type of the object of
the relevant property. When the object string and the type is a supported
datatype this then is the global datatyping mechanism.
In the example the Unicode node will have interpretation < "30", x >
for some x. Without the schema information, in RDF, any x is permitted.
The rdfs:range is only relevant in RDFS.
In RDFS, the range constraint applies,
and all valid interpretations will have < "30", x > as being in the
class extension of xsd:integer. This class extension
is systematically understood as referring to the map, rather than the
lexical or the value space, and thus x must be 30.
It is essential that both global (implicit) and local (explicit)
idioms be able to coexist within the same knowledge base without
undesired interactions -- and in fact, this is essential if
a global idiom is to be used as a constraint on locally defined
datatypes.
The rdfs:range and rdf:value+rdf:type idioms are fully compatable
and can cohabit the same knowledgebase freely.
The official desiderada for all proposed datatyping
solutions is defined in [RDF Desiderada].
This section clarifies how each desiderada is satisfied
by this proposal. The list of desiderada is taken verbatim
from the aforementioned document. Clarifications are in italics.
The TDL proposal meets all of the defined desiderata.
Backward compatibility with existing RDF data with existing RDF code with existing RDF-based specifications like DAML+OIL or CC/PP TDL is fully backwards compatible with all known systems
and idioms insofar as it does not require modification to the
present RDF graph model, does not require modification to the
present XML serialization, adopts the idioms presently used
by DAML+OIL, and (insofar as can be determined from the official
materials) is compatable with the typing idioms employed
by CC/PP.
Ability to use built-in primitive XML Schema datatypes TDL allows the use of any descendant of the XML Schema
type "anySimpleType", both the predefined types as well
as all custom types. This does not mean that every application
will support the interpretation or validation of values
associated with those types, but that all values of such
types can be denoted in RDF by a TDL pairing.
Ability to use non-XML-Schema datatypes TDL allows the use of any lexical datatype, conforming
to the definition given here and in reference documents to
that end, and which has URI denotation.
This does not mean that every application
will support the interpretation or validation of values
associated with those types, but that all values of such
types can be denoted in RDF by a TDL pairing.
Ability to define datatypes using schema languages rather than relying
on "built-in" data types. This is considered to be addressed in #3 above as well as
by the default interpretation of non-typed literals.
Ability to represent type information without an associated RDF schema The TDL local/explicit idiom provides for the representation of TDL
pairings, and thus the typing of literal values, without any need
to reference an external schema to determine typing of literals.
Ability to reference type information in an associated RDF schema The TDL global/implicit idiom provides for the representation of TDL
pairings, and thus the typing of literal values, to be encoded in
one or more external schemas to imply typing of literals and/or
constraints on the typing of locally typed literals.
Co-existence of "global" and "local" typing mechanisms
The TDL idioms for global and local typing are fully compatable and
may coexist freely in the same knowledge base without undesirable
interaction.
Provide account of datatyping scheme semantics The TDL proposal provides a full account of datatyping semantics.
Support for existing data typing idioms This is considered to be addressed in #1 above.
Datatypes are viewed as in Patel-Schneider's work [SWOL]. That is
each datatype d has four components:
For each d in DT
XML Schema views the map associated with a union datatype
as a function, even when the various types in the
union have overlapping domain.
The ambiguity is resolved by considering the order of
the union.
The TDL datatyping for RDF does not respect this.
In TDL the map associated with an XML Schema Union datatype is the set theoretic
union of the maps of each of the subtypes of the union. Thus strings in the overlap
of the domains are generally ambiguous.
This if all we know about a string is that it lies in a union type, then finite multiple
interpretations of that string may be valid, rather like the unconstrained ambiguity
for untyped strings.
As an example, if we say that an age property has range being the union of decimal integers
or binary integers we cannot tell whether someone who is "100" is very old or a
pre-schooler.
(But they are one or the other).
If the document author wishes to avoid this ambiguity then the subtype
should be specified, typically using the rdf:type+rdf:value local idiom.
This is preferred to the use of xsi:type recommended by XML Schema.
The motivation for this small departure from the XML Schema Datatype recommendation
is that RDF Model Theory is monotone
and hence does not accomodate the default mechanism inherent in XML Schema
Union datatypes.
There is no requirement to disambiguate the union, and the value can
be left as ambiguous.
Table of Contents
1 Introduction
2 Definition
2.1 Overview
2.2 An Introduction to the Model Theory for TDL
This section gives a light-weight overview, the interested reader should
read Appendix A for the full detail. XML Schema Union datatypes are omitted
from this section; see Appendix B for how they are addressed.
Datatypes are viewed as in Patel-Schneider's work [SWOL]. That is
each datatype has four components, a URI, a lexical space, a value space,
and a mapping.
An RDF interpretation is with respect to some set of datatypes, which corresponds
to the supported datatypes in an RDF implementation. An implementation is free to
not support datatyping, in which case the set of datatypes is empty.
Terminology
We modify the terminology of the Model Theory to differentiate between
literals before datatyping and literals after datatyping. The modification
is:
The Interpretation of Datatype Classes
The Interpretation of Unicode Nodes
An interpretation maps each Unicode node to some literal-value pair. The
unicode string component is given by the label on the node. The type information
is checked by requiring this pair to be a member of each class associated
with this node (e.g. by a range constraint). As above class membership
of datatype classes refers to the map of the datatype. Note that for
technical reasons the 'typed value' of the interpretation of untyped Unicode
nodes is unrestricted, i.e. there is no default type.
The Interpretation of rdf:value
Following Graham Klyne's suggestion rdf:value is simply equality.
The Interpretation of Asserted Triples
These changes to the model theory can be seen as changes in
the interpretation of triples.
Those with predicate rdf:value or rdf:type are both treated specially: rdf:value
as equality, and rdf:type knows the supported datatypes and treats them essentially
as the map of the datatype (i.e. <s, rdf:type, d> iff I(s) is
a literal-value pair in the map of d).
For other triples the model theory is unchanged, although in the Universe
of interpretation the old literal values are now represented as literal value
pairs, and hence the representation of triples with literal objects is slightly
different.
Multiple types
A literal-value pair may belong to multiple types, in which case a legal
RDF graph may show multiple type information for that literal-value pair,
using both the local or the global idioms. Sometimes the intersection of
multiple types may be surprisingly small but not empty, for example, a binary
integer type and a positive decimal integer type may have intersection {
("0",0), ("1",1) }; either of these two literal-values would be legal, but
a Unicode string "10" cannot be interpreted in the presence of such conflicting
type information, despite being in both lexical spaces and despite the two
value spaces being the same. (Contrast with S-B, which permits "10" in such
a case).
Unsupported Datatypes
An RDF implementation only knows some datatypes, and in particular may
not be aware of a datatype used in a particular RDF document. The model theory
reflects this by having an interpretation with respect to some set of datatypes
(the supported datatypes). In practice, documents with an unsupported
datatype constrain the datatype (in that the lexical occurrences in the document
must be in the lexical space of the datatype), whereas supported datatypes
constrain the document (in that the document may be ill-formed in that the
unicode nodes are labelled with strings that are not in the domain of the
relevant datatypes). The model theory is monotone with respect to the set
of supported datatypes; meaning that implementations supporting fewer datatypes
will make correct inferences but not all inferences. (e.g. they will not infer
a contradiction when datatyping is invalid).
3 Representation of Typed Data Literals in RDF
xmlns:rdf ="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs ="http://www.w3.org/2000/01/rdf-schema#"
xmlns:xsd ="http://www.w3.org/2001/XMLSchema#"
xmlns:ex ="uuid:f82dad84-0a58-11d6-9542-0003931df47c/"
3.1 The rdf:value+rdf:type Local Idiom
Model Theoretic Interpretation of Local Idiom
3.2 The rdfs:range Global Idiom
Model Theoretic Interpretation of Global Idiom
3.3 Compatability Between Idioms
4 Satisfaction of Desiderada
The model theory explicitly covers the old case of supporting no datatypes,
and behaves monotonically as new datatypes are added.
In as much as existing practice allows user typing of untyped literals
(as in the PL propoal [PL] and the Jena (v1.3) system),
the model theory respects that, in that untyped literals can be understood
as having any typed value.
Appendix A: The Model Theory for TDL
Unlike previous work, the mapping is a relationship rather than a function.
This is specifically to accomodate XML Schema Union datatypes. A full discussion
of these is found in the next appendix. For all other datatypes the mapping
is a function. Each datatype is a resource and is found in the Universe of
interpretation.
An RDF interpretation is with respect to some possibly empty set, DT, of
datatypes. DT is a subset of IR, the set of resources.
We use a set IR of resources, the set of U of Unicode strings and a set VL
of values. V(d) is a subset of VL for every d in DT. The Universe is IR
union ( U x VL )
Terminology
The Interpretation of Unicode Nodes
Each Unicode node is interpreted as a literal-value pair.
If E is labelled with u, then I(E) = (u,v) for some v in VL.
The Interpretation of Datatype URIs
If E is a node labelled with a uriref and the label of E=u(d) for some d in DT, then I(E) = d.
The Interpretation of Blank Nodes
The mapping A on blank nodes is unrestricted and a blank node can be interpreted
as any object in the Universe (including literal-value pairs).
The Interpretation of Asserted Triples
The function IEXT is modified as follows:
IEXT maps the set of properties IP into the powerset of ( Universe x Universe
).
IEXT(rdf:value) is the identity of the Univers
IEXT(rdf:type) contains the pair ( (unicode-string, value),
d )
if and only if (unicode-string, value) is in the map
associated with d.
Appendix B: Union Datatypes
References
Last Modified: $Date: 2002/01/25 11:46:28 $