The "Typed Data Literal" (TDL) Datatyping Scheme
Version
Version 2, January 31, 2002.
Abstract
This document describes the "Typed Data Literal" (TDL) datatyping scheme
(also known as "PD") which is one of the candidate proposals being discussed
by the RDF Core Working Group.
Status of this Document
The document has no normative status and merely provides a reference for
an ongoing discussion within the working group.
Authors
-
Jeremy Carroll
-
Patrick Stickler
Many working group members and members of RDF Interest have helped to shape
this document.
Special thanks to Dan Connolly for detecting
a serious flaw with version 1, and to Graham Klyne for helping fix it.
Table of Contents
1 Introduction
This document describes the "Typed Data Literal" (TDL) datatyping scheme,
which is one of several proposals under consideration by the RDF Core Working
Group (hereafter referred to simply as WG) for achieving a total solution
for datatyping based on the foundational RDF Datatyping Model [RDF
DT] which is itself defined in terms of the RDF Model Theory
[RDF
MT].
The TDL scheme, also known as "PDU" or "PD", is a fusion of the idioms
from two earlier schemes "P" and "D" (or "DAML") along with the conceptual
model from "U" (omitting the URV based idiom). When type information is
omitted the Model Theory for TDL captures the ambiguous typing of the Perl
programming idiom [PL].
C.f.
The formal treatment of TDL is presented as a modification to the RDF Model
Theory [RDF MT]. Datatyping is achieved during interpretation.
Each occurrence of a literal Unicode string may have its own node in the
graph and is interpreted according to the map(s) associated with the datatype(s)
associated by TDL with that node. The graph may be ill-formed because of
datatyping problems (e.g. "three" is not an integer). The informal intent
of TDL is to capture the normal programming paradigm that the input syntax
uses the lexical space of datatypes, and the "meaning" is in the value
space of the datatype. However, for technical reasons (mainly that the
typing in RDF MT is part of the model rather than the interpretation),
the interpretation of each Unicode string node in the graph is given as
a lexical-value pair within the Universe, which most of the time is treated
as being the value component. For Unicode string nodes with no datatype
information, or whose datatype is not supported, the lexical component
of the pair is more significant. As always, the intent of the Model Theory
is to capture concepts such as entailement, consistency etc. but not to
indicate an approach to implementation. In particular, the existence of
lexical-value pairs within the Universe of Interpretation is not intended
to indicate a deep metaphysical belief in such things!
2 Definition
2.1 Overview
As defined in section 2 of [RDF DT], for any given
member of a lexical space there exists a mapping to one and only one member
of the value space, referred to as the datatype mapping. Likewise,
for any given member of a canonical lexical space there exists a mapping
to one and only one member of the value space, referred to as the canonical
mapping. Because the unique and unambigous identity of the lexical,
canonical, and value spaces are inherent in the identity of the datatype
itself, by the very definition of a datatype, we may uniquely and unambiguously
denote a specific datatype mapping or canonical mapping, and hence a specific
value, simply by the pairing of a lexical form (member of the lexical space)
with the identity of the datatype (which in the case of RDF is a URI Reference).
[Definition:] The
pairing of a lexical form to a datatype identity is called a
typed data
literal (TDL).
If the lexical form is a member of a canonical lexical space, the TDL
denotes both a lexical mapping as well as a canonical mapping. Though,
for the purpose of mapping a lexical form to a value, any canonical mapping
is superfluous and redundant as the existence of a given canonical mapping
infers the existence of a datatype mapping having the same pair of lexical
form and value members.
A TDL uniquely denotes a member of the value
space of the datatype because there is a one-to-one correspondence between
TDL pairings and datatype mappings:
2.2 An Introduction to the Model Theory for TDL
TDL is formalized as changes to the existing RDF Model Theory.
This section gives a light-weight overview, the interested reader should
read Appendix A for the full detail. XML Schema Union
datatypes are omitted from this section; see Appendix B
for how they are addressed.
Datatypes are viewed as in Patel-Schneider's work [SWOL].
That is each datatype has four components, a URI, a lexical space, a value
space, and a mapping.
An RDF interpretation is with respect to some set of datatypes, which
corresponds to the supported datatypes in an RDF implementation. An implementation
is free to not support datatyping, in which case the set of datatypes is
empty.
Terminology
For clarity we use disjoint terminology to differentiate
between literals in the graph syntax and their interpretation both in the
universe of the model theory and by an RDF application.
-
We use the term "Unicode node" to refer to a node in the graph labelled
with a unicode string.
-
We use the term literal-value pair to refer to a pair consisting of a unicode
string and a 'typed value'. The interesting literal-value pairs are ones
that belong to the mapping of some datatype.
-
We do not use terminology such as "literal node" or "literal value".
-
We refer to the set of datatypes used in an RDF interpretation as the "supported
datatypes".
The literal-value pairs occur in the model theory's
universe. The intent is that RDF applications may manipulate either or
both of the Unicode string or the typed value.
The Interpretation of Datatype Classes
In RDF, classes can be thought of (informally) as corresponding to sets.
In this case rdf:type can be thought of as corresponding to set membership.
The model theory of various proposed datatyping mechanisms can be contrasted
as to which set a datatype then corresponds to.
In this proposal a datatype class corresponds to its map, a set of pairs
of lexical strings and their corresponding values.
The Interpretation of Unicode Nodes
An interpretation maps each Unicode node to some
literal-value pair. The unicode string component is given by the label
on the node. The type information is checked by requiring this pair to
be a member of each class associated with this node (e.g. by a range constraint).
As above class membership of datatype classes refers to the map of the
datatype. Note that for technical reasons the 'typed value' of the interpretation
of untyped Unicode nodes is unrestricted, i.e. there is no default type.
This interpretation of unicode nodes as literal-value
pairs is existentially quantified just like the interpretation of unlabelled
nodes. The first component on this interpretation is constrained to be
the unicode string, but the second component is unconstrained. The existential
quantification happens at graph scope like that of unlabelled node. This
has the effect that if there is any such literal-value pair which:
-
satisfies all the triples
-
in particular, satisfies the type constraints
then that pair is selected in the existential quanitification.
Thus, if there is a type constraint on the unicode node as xsd:integer,
for example, then the only value that can be selected for the string "20"
is the pair < "20", 20 >.
The Interpretation of Unlabelled Nodes
A further change to the model theory is that unlabelled
nodes may be mapped to anything in the universe including, literal-value
pairs.
The Interpretation of rdf:value
Following Graham Klyne's suggestion rdf:value is simply equality.
The Interpretation of Asserted Triples
These changes to the model theory can be seen as changes in the interpretation
of triples.
Those with predicate rdf:value or rdf:type are both treated specially:
rdf:value as equality, and rdf:type knows the supported datatypes and treats
them essentially as the map of the datatype (i.e. <s, rdf:type, d> iff
I(s) is a literal-value pair in the map of d).
For other triples the model theory is unchanged, although in the Universe
of interpretation the old literal values are now represented as literal
value pairs, and hence the representation of triples with literal objects
is slightly different.
Multiple types
A literal-value pair may belong to multiple types, in which case a legal
RDF graph may show multiple type information for that literal-value pair,
using both the local or the global idioms. Sometimes the intersection of
multiple types may be surprisingly small but not empty, for example, a
binary integer type and a positive decimal integer type may have intersection
{ ("0",0), ("1",1) }; either of these two literal-values would be legal,
but a Unicode string "10" cannot be interpreted in the presence of such
conflicting type information, despite being in both lexical spaces and
despite the two value spaces being the same. (Contrast with S-B, which
permits "10" in such a case).
Unsupported Datatypes
An RDF implementation only knows some datatypes, and in particular may
not be aware of a datatype used in a particular RDF document. The model
theory reflects this by having an interpretation with respect to some set
of datatypes (the supported datatypes). In practice, documents with
an unsupported datatype constrain the datatype (in that the lexical occurrences
in the document must be in the lexical space of the datatype), whereas
supported datatypes constrain the document (in that the document may be
ill-formed in that the unicode nodes are labelled with strings that are
not in the domain of the relevant datatypes). The model theory is monotone
with respect to the set of supported datatypes; meaning that implementations
supporting fewer datatypes will make correct inferences but not all inferences.
(e.g. they will not infer a contradiction when datatyping is invalid).
3 Representation of Typed Data Literals in RDF
A TDL may be defined in several ways in RDF, according to the particular
idiom used. This proposal outlines two such idioms for defining TDL pairings,
one for global (implicit) definitions and one for local (explicit) definitions.
Each idiom is defined separately below.
Note: For the sake of brevity and clarity, qualified names are used
in the examples provided in this section where normally URI References
are required. The following namespace declarations are assumed in the examples:
xmlns:rdf ="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs ="http://www.w3.org/2000/01/rdf-schema#"
xmlns:xsd ="http://www.w3.org/2001/XMLSchema#"
xmlns:ex ="uuid:f82dad84-0a58-11d6-9542-0003931df47c/"
3.1 The rdf:value+rdf:type Local Idiom
The rdf:value+rdf:type idiom provides a means to explicitly associate a
datatype with a literal value by the use of an anonymous node for which
the properties rdf:value and rdf:type are defined. The property rdf:value
takes the literal (lexical form) as its object and the property rdf:type
takes the URI Reference of the datatype.
Per the statements below, the lexical form
"30" is explicitly declared to be a member of the lexical space of the
datatype 'xsd:integer':
Model Theoretic Interpretation of Local Idiom
The interpretation of the blank node which is subject of the rdf:value
is constrained to be the same as the interpretation of the object of the
rdf:value, the unicode node. This is because rdf:value is interpreted as
the identity. Moreovoer this literal-value pair is required to be a mapping
in the datatype by the interpretation of the rdf:type edge.
In the example, the "30" is interpreted as < "30", x > for some x,
this same pair is also the interpretation of the blank node, which by the
rdf:type constraint lies in the mapping of xsd:integer. Hence x is the
integer 30.
3.2 The rdfs:range Global Idiom
The rdfs:range idiom utilizes the RDF Schema [RDF Schema] rdfs:range property
to define an implicit intersection of one or more lexical data types, which
may be used to imply or constrain the datatype(s) of a literal.
Per the following RDF statements, the lexical
form "30" is implied (or required) to be a member of the lexical space
of the datatype 'xsd:integer':
Whether the rdfs:range statement constitutes a constraint on the allowed
datatypes depends on whether there exists any local (explicit) type assignment.
If there is no local typing for the literal value whatsoever, then rdfs:range
can only serve as a global (implicit) type assignment. However, if the
literal has one or more types defined locally, and any locally specified
datatype is not compatible with all datatypes globally implied by rdfs:range
for the property, one can treat such a case as a contradition to a constraint
on the expected or required datatype(s) for the property in question.
Model Theoretic Interpretation of Global Idiom
The normal rdfs:range mechanism constrains the type of the object of the
relevant property. When the object string and the type is a supported datatype
this then is the global datatyping mechanism.
In the example the Unicode node will have interpretation < "30",
x > for some x. Without the schema information, in RDF, any x is permitted.
The rdfs:range is only relevant in RDFS. In RDFS, the range constraint
applies, and all valid interpretations will have < "30", x > as being
in the class extension of xsd:integer. This class extension is systematically
understood as referring to the map, rather than the lexical or the value
space, and thus x must be 30.
3.3 Compatability Between Idioms
It is essential that both global (implicit) and local (explicit) idioms
be able to coexist within the same knowledge base without undesired interactions
-- and in fact, this is essential if a global idiom is to be used as a
constraint on locally defined datatypes. The rdfs:range and rdf:value+rdf:type
idioms are fully compatable and can cohabit the same knowledgebase freely.
Cohabitation of global and local idioms:
4 Satisfaction of Desiderada
The official desiderada for all proposed datatyping solutions is defined
in [RDF Desiderada].
This section clarifies how each desiderada is satisfied by this proposal.
The list of desiderada is taken verbatim from the aforementioned document.
Clarifications are in italics.
The TDL proposal meets all of the defined desiderata.
Backward compatibility
TDL is fully backwards compatible with all known systems and idioms
insofar as it does not require modification to the present RDF graph model,
does not require modification to the present XML serialization, adopts
the idioms presently used by DAML+OIL, and (insofar as can be determined
from the official materials) is compatable with the typing idioms employed
by CC/PP.
The model theory explicitly covers the old case of supporting no
datatypes, and behaves monotonically as new datatypes are added.
In as much as existing practice allows user typing of untyped literals
(as in the PL propoal [PL] and the Jena (v1.3) system), the model theory
respects that, in that untyped literals can be understood as having any
typed value.
Ability to use built-in primitive XML Schema datatypes
TDL allows the use of any descendant of the XML Schema type "anySimpleType",
both the predefined types as well as all custom types. This does not mean
that every application will support the interpretation or validation of
values associated with those types, but that all values of such types can
be denoted in RDF by a TDL pairing.
Ability to use non-XML-Schema datatypes
TDL allows the use of any lexical datatype, conforming to the definition
given here and in reference documents to that end, and which has URI denotation.
This does not mean that every application will support the interpretation
or validation of values associated with those types, but that all values
of such types can be denoted in RDF by a TDL pairing.
Ability to define datatypes using schema languages rather than relying
on "built-in" data types.
This is considered to be addressed in #3 above as well as by the
default interpretation of non-typed literals.
Ability to represent type information without an associated RDF schema
The TDL local/explicit idiom provides for the representation of TDL
pairings, and thus the typing of literal values, without any need to reference
an external schema to determine typing of literals.
Ability to reference type information in an associated RDF schema
The TDL global/implicit idiom provides for the representation of
TDL pairings, and thus the typing of literal values, to be encoded in one
or more external schemas to imply typing of literals and/or constraints
on the typing of locally typed literals.
Co-existence of "global" and "local" typing mechanisms
The TDL idioms for global and local typing are fully compatable and
may coexist freely in the same knowledge base without undesirable interaction.
Provide account of datatyping scheme semantics
The TDL proposal provides a full account of datatyping semantics.
Support for existing data typing idioms
This is considered to be addressed in #1 above.
Appendix A: The Model Theory for TDL
Datatypes are viewed as in Patel-Schneider's work [SWOL].
That is each datatype d has four components:
-
u(d)
-
the URI reference
-
L(d)
-
the lexical space (subset of the se of Unicode strings)
-
V(d)
-
the value space,
-
M(d)
-
a subset of L(d) x V(d), such that there is at least one pair in M(d) for
each string of L(d), and at least one pair in M(d) for each value in V(d).
Unlike previous work, the mapping is a relationship rather than a function.
This is specifically to accomodate XML Schema Union datatypes. A full discussion
of these is found in the next appendix. For all other datatypes the mapping
is a function. Each datatype is a resource and is found in the Universe
of interpretation.
An RDF interpretation is with respect to some possibly empty set, DT,
of datatypes. DT is a subset of IR, the set of resources.
We use a set IR of resources, the set of U of Unicode strings and a
set VL of values. V(d) is a subset of VL for every d in DT. The Universe
is IR union ( U x VL )
Terminology
-
Unicode node
-
a node in the graph labelled with a unicode string.
-
literal-value pair
-
a pair in U x VL.
The Interpretation of Unicode Nodes
Unicode nodes are not permitted in ground graphs,
but are treated similarly to blank nodes.
Each Unicode node is existentially interpreted
as a literal-value pair, using the interpretation function A.
i.e.
If E is labelled with u, then A(E) = (u,v) for
some v in VL.
The Interpretation of Datatype URIs
If E is a node labelled with a uriref and the label of E=u(d) for some
d in DT, then I(E) = d.
The Interpretation of Blank Nodes
The mapping A on blank nodes is unrestricted and a blank node can be interpreted
as any object in the Universe (including literal-value pairs).
The Interpretation of Asserted Triples
The function IEXT is modified as follows:
IEXT maps the set of properties IP into the powerset of ( Universe
x Universe ).
IEXT(rdf:value) is the identity of the Univers
For each d in DT
IEXT(rdf:type) contains the pair ( (unicode-string,
value), d )
if and only if (unicode-string, value) is in the
map associated with d.
The Interpretation of Graphs
Because Unicode nodes are being treated existentially,
the set anon(E) of a graph E is redefined to be all the unlabelled nodes
and unicode nodes in E.
Given this, and given the modifications of A
above, the interpretation of a graph E is unchanged and is defined as:
If E is an RDF graph then
I(E) = true if [I+A'](E) = true for some mapping A' from anon(E) to IR,
otherwise I(E)= false
This corresponds to saying that
given a mapping IS from the vocabulary
to the universe
a graph is consistent with a that mapping
if there is some mapping from the unlabelled
nodes to the universe
and some mapping of literal nodes to values
such that all the triples hold. (Including all
those that constrain the datatype of each literal node)
Appendix B: Union Datatypes
XML Schema views the map associated with a union datatype as a function,
even when the various types in the union have overlapping domain. The ambiguity
is resolved by considering the order of the union.
The TDL datatyping for RDF does not respect this.
In TDL the map associated with an XML Schema Union datatype is the set
theoretic union of the maps of each of the subtypes of the union. Thus
strings in the overlap of the domains are generally ambiguous.
This if all we know about a string is that it lies in a union type,
then finite multiple interpretations of that string may be valid, rather
like the unconstrained ambiguity for untyped strings.
As an example, if we say that an age property has range being the union
of decimal integers or binary integers we cannot tell whether someone who
is "100" is very old or a pre-schooler. (But they are one or the other).
If the document author wishes to avoid this ambiguity then the subtype
should be specified, typically using the rdf:type+rdf:value local idiom.
This is preferred to the use of xsi:type recommended by XML Schema.
The motivation for this small departure from the XML Schema Datatype
recommendation is that RDF Model Theory is monotone and hence does not
accomodate the default mechanism inherent in XML Schema Union datatypes.
There is no requirement to disambiguate the union, and the value
can be left as ambiguous.
References
-
[SWOL]
-
Peter Patel-Schneider,
The Semantic Web Ontology Language (SWOL),http://lists.w3.org/Archives/Public/www-webont-wg/2001Dec/att-0156/01-swol2.
text
-
[PL]
-
Dan Connoly, PL: how a PERL programmer might do datatypes in RDF,http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Dec/0003.html
-
[RDF Core WG Charter]
-
W3C RDF Core Working Group Charter, Mar 2001, http://www.w3.org/2001/sw/RDFCoreWGCharter
-
[RDF Desiderada]
-
Graham Klyne, RDF datatyping desiderada, Jan 2002, http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Jan/0137.html
-
[RDF
MT]
-
W3C RDF Model Theory Working Draft, Jan 2002, http://lists.w3.org/Archives/Public/www-archive/2002Jan/att-0007/01-RDF_Model_Theory.htm
-
[RDF DT]
-
W3C RDF Datatyping Working Draft, Sep 2001,
http://www-nrc.nokia.com/sw/RDF_DT_Foundation.html
-
[RDF Schema]
-
W3C RDF Schema Recommendation, Mar 2000, http://www.w3.org/TR/2000/CR-rdf-schema-20000327/
-
[XSD]
-
World Wide Web Consortium, XML Schema Part 2: Datatypes, http://www.w3.org/TR/xmlschema-2/
Last Modified:
$Date: 2002/01/25 11:46:28 $