Re: ACTION-33: Datatype-related issues

Thanks Jeni,

some remarks below,


On 18 Oct 2014, at 21:58 , Jeni Tennison <jeni@jenitennison.com> wrote:

> Hi,
> 
> This is a brief note intended to get people thinking about how we should handle data typing for CSV files.
> 
> # Background
> 
> How are datatypes handled elsewhere?
> 
> ## XML Schema Datatypes
> 
> The datatypes defined within XML Schema [1] are now fairly common across W3C specs, including use in XPath and RDF.
> 
> They make a distinction between the ‘lexical space’ for a datatype (the characters in a file that represent the value, eg the string ‘2014-10-18') and the ‘value space’ for a datatype (the semantic value, eg the date 18th October 2014).
> 
> They include a large range of standard names for datatypes, some of which are regarded as ‘primitive’ and others ‘derived’. Some of the datatypes are really only relevant in an XML world (eg NOTATION).
> 
> They also define ‘facets’ of datatypes which are used to create derived datatypes from either primitive datatypes or from other derived datatypes (eg min/max length). New datatypes can be created through applying these facets, by creating a ‘union’ between other datatypes, or by creating a list type (whose items are always separated by whitespace in the lexical representation). XML elements can also have xml:nil values (attributes can’t).
> 
> ## RDF Datatypes
> 
> RDF subsets XML Schema datatypes (excluding those which are XML-specific) but also defines a couple of its own [2], namely rdf:HTML and rdf:XMLLiteral. It also defines lists through rdf:List and nil values through rdf:nil.
> 
> ## JSON Datatypes
> 
> JSON itself has strings, numbers, booleans, arrays, objects and null values. JSON-schema defines additional formats that strings might be tested against [3], namely date-time, email, hostname, ipv4, ipv6 and uri.
> 
> JSON Table Schema [4] defines additionally date, time, datetime,  binary, geopoint, and geojson, and separately defines formats which are both lexical-to-value-mappings for datatypes like dates and restrictions on the syntax of types eg to support markdown. Separately, it supports constraints on values such as their min/max length.
> 
> ## Current CSV Metadata Spec
> 
> The current CSV metadata spec merges the relevant XML Schema types as used by RDF with some of the terms used within JSON Table Schema to define the semantic type for the values in a column [5]. It also has a method of further constraining those types using the XML Schema facets (but not for defining or reusing types defined elsewhere). The ‘format’ both provides for further syntactic validation and a way of defining a mapping from lexical to semantic values.
> 
> # Particular Issues
> 
> 1. How should we define mappings between lexical and semantic values? CSV files are much more likely to be read by humans directly than JSON, XML or RDF, so it makes sense to, for example, specify that a CSV file contains dates in the form 'DD/MM/YYYY’ rather than force all columns containing dates to use the syntax ‘YYYY-MM-DD’. How should those formats be specified?
> 

We certainly cannot expect users to always use the ISO format. I have tried to find out whether there is a standard way of expressing the date formats, but I did not find any; we may have to define our own. For javascript, I found the fairly complete 'moment.js' tool, it has a formatting description at:

http://momentjs.com/docs/#/displaying/format/

For python, there is:

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

of course, they are not fully identical (otherwise life would be way too easy:-). I am sure that other languages like Java have similar definitions. 


> 2. How should we specify the inclusion of HTML/XML/JSON/markdown etc within CSV values? Should we define these datatypes and if so how? Should it be extensible to other formats based eg on media type?
> 

We had lots of discussions in defining the HTML and XML datatypes properly in the current version of RDF 1.1:

http://localhost:8001/TR/rdf11-concepts/#section-html

the issue is the proper definition of the value space and, mainly, when would two data considered to be equal.  But we may face the same issues with JSON or markdown (I have the feeling that there are subtle differences among markdown implementations, for example...). This may be hairy.

That being said, we may not need to define these things in such a precise way, ie, the issue of equality may not be relevant. In which case a media type may be enough... (Although: is there a separate media type for an HTML or XML *fragment*? Ie, not a whole XML or HTML file?)

> 3. Even with the non-XML subset of XML Schema datatypes, the list of datatypes is quite long and some are very obscure (eg xs:nonNegativeInteger). What should the subset be?
> 

Actually, I do not find the datatype itself very obscure (non negative integer is a pretty clear concept) except for its name. But, thanks to the @context mechanism, we can choose any name we want, can't we?


> 4. It would be nice if CSV schemas could reuse datatypes defined in other CSV schemas (this is particularly useful for codes). How should we do that?
> 

I am not sure what you mean. Can a schema part of a metadata refer to a schema part of another metadata file? 

> 5. We’ve identified a requirement for list types. Should we do what XML Schema does and only allow space separated values? Or enable any separator to be used?
> 

Let us not overcomplicate our life... Let us use the space. 


> 6. There’s certainly the need for union types (eg either the value ‘XX’ or ’N/A’ or a number). How should those be specified?
> 
> 7. There’s also a need for enumerated values (eg big/medium/small). Is the best way of doing that through a regular expression, a list of values, or using a reference to a list in another CSV file?
> 

I am not sure, neither for #6 nor for #7. But my slightly higher level fear is with the goggle of CSV->JSON/RDF/XML mapping: because the fundamental model of that mapping is that it directly reuses the metadata, I am a bit afraid that the mapping will become more and more complicated. We should keep that aspect in mind (ie, to keep it simple) as an important factor in any decision we take...

Ivan


> Others...?
> 
> Cheers,
> 
> Jeni
> 
> [1] http://www.w3.org/TR/xmlschema-2/
> [2] http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#section-Datatypes
> [3] http://json-schema.org/latest/json-schema-validation.html#anchor104
> [4] http://dataprotocols.org/json-table-schema/#field-types
> [5] http://w3c.github.io/csvw/metadata/#datatypes
> --  
> Jeni Tennison
> http://www.jenitennison.com/
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
WebID: http://www.ivan-herman.net/foaf#me

Received on Monday, 20 October 2014 09:46:53 UTC