RDF Datatyping
Abstract
This document summarizes the proposals for RDF datatyping that are
currently considered by the RDF Core Working Group (further referred
to as WG).
Status of this Document
The document has no normative status and merely provides a reference
for an ongoing discussion within the WG.
Contributors
This document includes contributions of almost all members of the
WG, in particular those provided by
- Jeremy Carroll
- Dan Connoly
- Pat Hayes
- Graham Klyne
- Frank Manola
- Sergey Melnik (volunteered editor of this document)
- Patrick Stickler
Many other WG members not listed above have helped to shape this
document.
Table of Contents
(@@@ to be completed when the content and section numbers stabilize)
Scope
The RDF Core Working Group is not chartered to develop a separate data
typing language that duplicates facilities provided by XML Schema data
types (see RDF Core WG Charter).
Desiderata for RDF Datatyping
(@@@ in no particular order)
- Backward compatibility
- with existing RDF data
- with existing RDF code
- with existing RDF-based specifications like DAML+OIL or CC/PP.
- Ability to use XML Schema datatypes (@@@ what about non-atomic and user-defined types?)
- Ability to use non-XML-Schema datatypes (custom or user-defined datatypes, or those from major components external to RDF, like SQL or UML datatypes).
- Ability to define datatypes using schema languages rather than built-ins.
- Ability to represent type information without an associated schema.
- Ability to represent type information in an associated schema.
Deliverables of RDF Datatyping
- Develop a framework for using datatypes in RDF.
- Provide guidelines for using XML Schema Datatypes in RDF.
Type System
The conceptual framework for datatyping presented in this document is
based on the type system defined in the "XML Schema Part 2: Datatypes"
[XSD]. This section explains how the relevant
terms and concepts defined in [XSD] are expressed
using the model-theoretic semantics for RDF defined in the "RDF Model
Theory Working Draft" [RDF MT].
Datatype mapping
[XSD] defines a datatype as a 3-tuple,
consisting of a) a set of distinct values, called its value
space, b) a set of lexical representations, called its lexical
space, and c) a set of facets that characterize properties
of the value space, individual values or lexical terms. [XSD] implicitly assumes a fourth component, which we
call datatype mapping, to be part of the datatype.
[Definition:] A
datatype mapping is a set of pairs whose first element belongs
to the value space of the datatype, and the second element belongs to the lexical
space of the datatype.
A datatype mapping satisfies the following properties:
- Each element of the lexical space maps to exactly one element of the value space.
- Each element of the value space has at least one lexical representation.
(@@@ is the second condition necessary? Should we distinguish between
partial and complete datatype mappings?)
Datatype mapping for a datatype "boolean". Each element of the value
space has two lexical representations.
Value space: {T, F}
Lexical space: {"0", "1", "true", "false"}
Datatype mapping: {<T, "true">, <T, "1">, <F, "0">, <F, "false">}
Canonical datatype mapping
As specified in [XSD], a canonical lexical
representation is a set of elements from the lexical space of a datatype
such that there is a one-to-one mapping
between elements in the canonical lexical representation and elements in
the value space. This mapping is referred to as canonical datatype
mapping.
[Definition:]
A canonical datatype mapping is a subset of a
datatype mapping that establishes a one-to-one correspondence between
elements in the canonical lexical representation and elements in the
value space.
A canonical datatype mapping for the datatype "boolean" of previous example.
Canonical datatype mapping: {<T, "true">, <F, "false">}
Datatyping schemes
[Definition:] A
datatyping scheme is a convention for representing and using datatypes in RDF.
A datatyping scheme describes how
- value spaces of datatypes,
- lexical spaces of datatypes,
- datatype mappings,
- canonical datatype mappings,
- datatypes themselves,
- individual elements of value spaces, and
- individual elements of lexical spaces
are represented in RDF graphs and interpreted using model-theoretic semantics.
[RDF MT] explains the fundamental model-theoretic concepts like
interpretation, universe, extension etc. used for interpreting the semantics of RDF graphs.
This document assumes familiarity with these basic concepts.
Facets
Specification and interpretation of datatype facets is out of scope of
this document.
Datatyping scheme "S"
This section explains how datatyping is introduced in the so-called "S" scheme,
one of the candidate suggestions that has been discussed by the WG.
In accordance with [RDF MT], the primary RDF syntax used in the "S" scheme
is based on tidy graphs (a tidy graph is the one in which no two nodes carry the same label).
The interpretation of each literal is assumed fixed and determined by
its content. (For example, the interpretation of literals could be defined as an identity
mapping.)
Datatypes in a model-theoretic interpretation
A datatype mapping is considered to be a binary
relational extension that exists in a model-theoretic interpretation. Both
the value space of the datatype and its lexical space are subsets of
the universe used in the interpretation.
Datatype mapping of datatype boolean in a model-theoretic
interpretation. The colored dots represent entities in the universe. The solid lines
represent pairs in the datatype mapping.
(@@@ this document does not describe what the extensions of datatypes themselves are in
a model-theoretic interpretation.)
Representation of datatype mappings
A datatype mapping can be "named" using
RDF properties. In this document, such properties are referred to as
datatype properties. To associate a datatype property with a certain datatype,
the extension of the datatype property is defined to be the datatype mapping that belongs
to the datatype.
Extension of property xsd:boolean.map is defined explicitly as a datatype mapping.
T and F are distinct elements in the universe, I is the interpretation function.
IEXT(I(xsd:boolean.map)) := {<T, I("true")>, <T, I("1")>, <F, I("0")>, <F, I("false")>}
Representation of value spaces and lexical spaces
Since value spaces and lexical spaces are subsets of the elements in the universe, they
can viewed as class extensions and can be referred to in RDF graphs by means of
resources that identify classes.
Class extensions of resources xsd:boolean.val and xsd:boolean.lex are defined explicitly.
CEXT(I(xsd:boolean.lex)) := {I("true"), I("1"), I("0"), I("false")}
CEXT(I(xsd:boolean.val)) := {T, F}
Representation of elements of value spaces and lexical spaces
In RDF graphs, literals can be used to refer to the elements of the
lexical spaces of datatypes.
Using a combination of datatype properties and literals, it is possible to
refer to data values of datatypes.
Datatype property xsd:boolean.map is used to constrain the interpretation of the
blank nodes shown at the top of the figure.
At the bottom of the figure, a possible valid interpretation is shown.
Since the semantics of literals and that of property xsd:boolean.map are fixed,
both blank nodes denote exactly
one element of the universe (element T) in each valid interpretation of the graph.
The literals true and 1 denote the corresponding elements of the lexical space of
the datatype boolean.
The elements of value spaces can be "named" explicitly using
URI references. For example, a resource with URI reference
like xsd:boolean.value.T could be used to denote the
element T of the value space of xsd:boolean.
This document does not suggest any such explicit identifiers for
the elements of value spaces.
XML Schema datatypes
This section explains how built-in atomic XML Schema datatypes can be used
in the datatyping scheme "S". Non-atomic types (like IDREFS) are out of scope of
this document.
[XSD] specifies a unique
URI reference for each built-in datatype that is defines.
For example, the URI
- http://www.w3.org/2001/XMLSchema#int
is used to address the datatype int. [XSD] suggests that the
components of the datatypes (specifically, datatype facets) be addressed using
URIs constructed by appending "." and the name of the component to the URI
of the datatype.
This document proposes that the identifiers for the lexical spaces, value spaces, datatype mappings
and canonical datatype mappings be constructed following the above principle.
For example:
- http://www.w3.org/2001/XMLSchema#int.lex can be used to denote
the lexical space of the datatype int.
- http://www.w3.org/2001/XMLSchema#int.val can be used to denote
the value space of the datatype int.
- http://www.w3.org/2001/XMLSchema#int.map can be used to denote
the datatype mapping of the datatype int.
- http://www.w3.org/2001/XMLSchema#int.cmap can be used to denote
the canonical datatype mapping of the datatype int.
This document proposes a fixed interpretation of resources with URI references
like the ones listed above that correspond to the following
XSD datatypes:
- Built-in primitive types
- duration, dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay, gMonth,
boolean, base64Binary, hexBinary, float, double, anyURI, QName, NOTATION,
string, decimal
- Built-in derived types
- integer, nonPositiveInteger, nonNegativeInteger, negativeInteger, positiveInteger,
int, long, short, byte, unsignedLong, unsignedInt, unsignedByte,
language, Name, NCName, NMTOKEN, ID, IDREF, ENTITY
(@@@ do we need all those XML-specific types like IDREF?)
In this document, the shortcut "xsd:" is used to abbreviate the
namespace http://www.w3.org/2001/XMLSchema#.
Definition of datatypes
[XSD] discusses several
ways of defining datatypes (e.g., axiomatically, by enumeration, by restriction).
This document only considers axiomatic definitions of datatypes.
User-defined datatypes and dedicated vocabularies for datatype definition
are out of scope of this document.
As illustrated in [XSD] (Sec. 3),
the datatype mappings of the derived types can be arranged in a hierarchy.
For example, type int is derived (by restriction) from long,
which is derived from integer, which is derived from decimal, which is a primitive type.
[Definition:]
Datatype B is derived by restriction from datatype A,
if and only if the datatype mapping of B
is contained as a subset in the datatype mapping of A.
The built-in derived types of [XSD] satisfy the above definition.
A new derived type can be obtained by restricting the lexical space of a datatype.
In this case, the datatype mapping of the new type is a range-restricted
datatype mapping of the source type.
Relating datatypes and their components
Schema languages like [RDF Schema] can be used
to relate different datatype components, e.g. datatype mappings and lexical spaces,
to each other. Explicit relationships between
datatype components help reduce the amount of built-in semantics that
needs to be hard-coded into applications.
Datatype mappings can be related to the corresponding value and lexical
spaces using
[RDF Schema] properties
rdfs:domain and
rdfs:range.
xsd:decimal.map rdfs:range xsd:decimal.val
xsd:decimal.map rdfs:domain xsd:decimal.lex
Notice that according to [RDF MT],
properties
rdfs:domain and rdfs:range define
a subset relationship between the range of a property
extension and the corresponding class extension. In other words,
rdfs:domain and rdfs:range alone
are not sufficient to define precisely the
interpretations of xsd:decimal.lex and xsd:decimal.val given
an axiomatic definition of xsd:decimal.map.
Derived datatypes can be indicated as such explicitly using
the property rdfs:subPropertyOf. The same approach can be used
to specify that canonical datatype mappings are contained
within their respective datatype mappings.
xsd:integer.map rdfs:subPropertyOf xsd:decimal.map
xsd:long.map rdfs:subPropertyOf xsd:integer.map
xsd:int.map rdfs:subPropertyOf xsd:long.map
xsd:int.cmap rdfs:subPropertyOf xsd:int.map
Inclusion hierarchies of value spaces and lexical spaces
can be specified explicitly using
the property rdfs:subClassOf.
xsd:integer.val rdfs:subClassOf xsd:decimal.val
xsd:long.val rdfs:subClassOf xsd:integer.val
xsd:int.val rdfs:subClassOf xsd:long.val
xsd:decimal.lex rdfs:subClassOf rdfs:Literal
xsd:integer.lex rdfs:subClassOf xsd:decimal.lex
xsd:long.lex rdfs:subClassOf xsd:integer.lex
xsd:int.lex rdfs:subClassOf xsd:long.lex
Discussion of selected datatypes
xsd:string
The datatype mapping of the datatype string is interpreted as an identity mapping.
The datatype property xsd:string.map can be used to refer to literals as if they
were in the subject position. The blank node in the left graph
denotes the same entity as the literal "P1Y" in each valid interpretation.
(@@@ this is another argument for allowing literals as subjects...)
xsd:base64Binary and xsd:hexBinary
The value space of xsd:base64Binary is equivalent to
the value space of xsd:hexBinary and is the set of
finite-length sequences of binary octets.
The value space of xsd:string are sequences of characters.
That is, xsd:base64Binary.val is disjoint from xsd:string.val.
Datatype property
xsd:base64Binary.map (or
xsd:hexBinary.map) can be used to refer to arbitrary
binary data in RDF graphs.
Apparently, there is no mechanism in
[XSD] that bridges
the gap between sequences of octets and sequences of characters.
In the figure, property
octectsToChars fulfils this role.
If
xsd:string.map is used instead of
octetsToChars, the graph
has no valid interpretation.
Modeling styles
The "S" scheme supports two distinct ways of using typed values in RDF graphs, or
two different "idioms". This document does not prescribe which of the idioms
should be used in RDF applications. The following subsections illustrate and compare
these two idioms.
Idiom A ("advanced")
In Idiom A, the elements of value spaces of datatypes are used for
representing typed data elements.
The rdfs:range of property
exA:birthdate is defined as the value space of the datatype date.
Jenny's birthdate is July 15, 2001.
Idiom B ("backward compatible")
In Idiom B, the elements of lexical spaces of datatypes are used for
representing typed data elements.
The rdfs:range of property
exB:birthdate is defined as the lexical space of the datatype date.
Jenny's birthdate is July 15, 2001.
Frequently, interpretations of literals belong to lexical spaces of several datatypes.
For example, the interpretation I("10") of literal "10" is both an element of
the lexical space of xsd:string and the element of the lexical space of
xsd:integer.
In Idiom B, schema information (e.g., specified using [RDF Schema])
provides a hint for the validation and usage of the literals.
Use of Idiom B is akin to type handling in programming languages like Perl.
In this perspective, literals correspond to scalars, which
are typecast depending on the input/output type of operations
(see [PL] for a detailed discussion).
Idiom A versus Idiom B
Many existing RDF applications deploy Idiom B. The major advantages of Idiom B
are backward compatibility and compactness. The fact that Idiom B utilizes
the elements of lexical spaces rather than the elements of value spaces is
unimportant for most applications.
Use of Idiom A is advantageous for evolving applications, especially those
that need to interoperate with other applications. Idiom A supports
multiple lexical representations for a given data value
in an RDF graph. This feature facilitates migration and parallel use
of alternative lexical representations (e.g., encoding "2001-07-15" can supercede
"July 15, 2001" without breaking the existing applications).
Another feature of Idiom A is enforcement of local typing, i.e.,
the typing information always accompanies the data instances in RDF graphs.
This feature makes data instances more robust with respect to incompatible changes
in schemas.
Open issues
- How does xsd:int relate to xsd:int.val, xsd:int.map etc.?
Is xsd:int--rdfs:subClassOf-->xsd:long meaningful? What does it mean?
- Need to explain how union types fit into "S" scheme?
@@@ Unit System (likely out of scope)
Measures like mass, duration, or monetary value are used in a variety of applications.
Measures are quite similar to datatypes. In fact, [XSD]
defines durations as datatypes, i.e., a lexical token like "P1Y" is mapped
to a duration of one year using a datatype mapping.
Many measures like mass, volume or monetary values are expressed
using quantities of units like kilograms, gallons, or US dollars.
This section defines a unit system and illustrates its use in the "S" scheme.
Just like lexical tokens
are mapped to typed values using datatype mappings, numbers are mapped
to values of measure using unit mappings.
[Definition:]
A unit type is a 3-tuple, consisting of a) a set of distinct values,
called its measure space, b) a set of numbers, called its numeric space, and
c) a one-to-one mapping between the numeric space and the measure space called its
unit mapping.
(@@@ what organizations standardize measures? NIST, ISO, DIN? Should we define
vocabulary for the Metric system? This could be a proper contribution of this document ;-)
The interpretation shown in the figure describes
two one-year-old toddlers, Robby and Jenny. Robby weighs 14 kg,
whereas Jenny weighs only 10 kg. The interpretation contains 13
elements in the domain of discourse (2 toddlers, 2 masses, 1 duration,
3 real numbers, and 5 strings), which are labeled using identifiers in
bold. The wide arrows represent extensions of relationships.
The interpretation contains three unit mappings: inKg, inYears, and inMonth, and three
datatype mappings: xsd:duration.map, xsd:decimal.map, and inOctal.
(@@@ what needs to be explained? What Ntriples, RDF/XML etc. examples need to be provided?
Should the different idioms be illustrated using the above example?)
Graphs
Ntriple
_:Jenny age _:1
_:1 xsd:duration.map "P1Y"
_:Jenny weight _:2
_:2 inKg _:3
_:3 inOctal "14"
_:Robby age _:4
_:4 inYears _:5
_:5 xsd:decimal.map "1"
_:Robby weight _:6
_:6 inKg _:7
_:7 xsd:decimal.map "14"
RDF/XML
(@@@ is the encoding below correct. Dave?)
<rdf:Description>
<age xsd:duration.map="P1Y"/>
<weight rdf:parseType="Resource">
<inKg inOctal="14"/>
</weight>
</rdf:Description>
<rdf:Description>
<age rdf:parseType="Resource">
<inYears xsd:decimal.map="1"/>
</age>
<weight rdf:parseType="Resource">
<inKg xsd:decimal.map="14"/>
</weight>
</rdf:Description>
Datatyping scheme "P/P++"
(@@@ incomplete, removed for now)
Datatyping scheme "U"
(@@@ incomplete, removed for now)
References
Last modified: Mon Dec 10 12:35:33 PST 2001