ACTION-33: Datatype-related issues

Hi,

This is a brief note intended to get people thinking about how we should handle data typing for CSV files.

# Background

How are datatypes handled elsewhere?

## XML Schema Datatypes

The datatypes defined within XML Schema [1] are now fairly common across W3C specs, including use in XPath and RDF.

They make a distinction between the ‘lexical space’ for a datatype (the characters in a file that represent the value, eg the string ‘2014-10-18') and the ‘value space’ for a datatype (the semantic value, eg the date 18th October 2014).

They include a large range of standard names for datatypes, some of which are regarded as ‘primitive’ and others ‘derived’. Some of the datatypes are really only relevant in an XML world (eg NOTATION).

They also define ‘facets’ of datatypes which are used to create derived datatypes from either primitive datatypes or from other derived datatypes (eg min/max length). New datatypes can be created through applying these facets, by creating a ‘union’ between other datatypes, or by creating a list type (whose items are always separated by whitespace in the lexical representation). XML elements can also have xml:nil values (attributes can’t).

## RDF Datatypes

RDF subsets XML Schema datatypes (excluding those which are XML-specific) but also defines a couple of its own [2], namely rdf:HTML and rdf:XMLLiteral. It also defines lists through rdf:List and nil values through rdf:nil.

## JSON Datatypes

JSON itself has strings, numbers, booleans, arrays, objects and null values. JSON-schema defines additional formats that strings might be tested against [3], namely date-time, email, hostname, ipv4, ipv6 and uri.

JSON Table Schema [4] defines additionally date, time, datetime,  binary, geopoint, and geojson, and separately defines formats which are both lexical-to-value-mappings for datatypes like dates and restrictions on the syntax of types eg to support markdown. Separately, it supports constraints on values such as their min/max length.

## Current CSV Metadata Spec

The current CSV metadata spec merges the relevant XML Schema types as used by RDF with some of the terms used within JSON Table Schema to define the semantic type for the values in a column [5]. It also has a method of further constraining those types using the XML Schema facets (but not for defining or reusing types defined elsewhere). The ‘format’ both provides for further syntactic validation and a way of defining a mapping from lexical to semantic values.

# Particular Issues

1. How should we define mappings between lexical and semantic values? CSV files are much more likely to be read by humans directly than JSON, XML or RDF, so it makes sense to, for example, specify that a CSV file contains dates in the form 'DD/MM/YYYY’ rather than force all columns containing dates to use the syntax ‘YYYY-MM-DD’. How should those formats be specified?

2. How should we specify the inclusion of HTML/XML/JSON/markdown etc within CSV values? Should we define these datatypes and if so how? Should it be extensible to other formats based eg on media type?

3. Even with the non-XML subset of XML Schema datatypes, the list of datatypes is quite long and some are very obscure (eg xs:nonNegativeInteger). What should the subset be?

4. It would be nice if CSV schemas could reuse datatypes defined in other CSV schemas (this is particularly useful for codes). How should we do that?

5. We’ve identified a requirement for list types. Should we do what XML Schema does and only allow space separated values? Or enable any separator to be used?

6. There’s certainly the need for union types (eg either the value ‘XX’ or ’N/A’ or a number). How should those be specified?

7. There’s also a need for enumerated values (eg big/medium/small). Is the best way of doing that through a regular expression, a list of values, or using a reference to a list in another CSV file?

Others...?

Cheers,

Jeni

[1] http://www.w3.org/TR/xmlschema-2/
[2] http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#section-Datatypes
[3] http://json-schema.org/latest/json-schema-validation.html#anchor104
[4] http://dataprotocols.org/json-table-schema/#field-types
[5] http://w3c.github.io/csvw/metadata/#datatypes
--  
Jeni Tennison
http://www.jenitennison.com/

Received on Saturday, 18 October 2014 19:59:15 UTC