Request for comments: Breaking down the datatype mapping problem (ISSUE-69)

I've tried to break down the datatype mapping question – ISSUE-69 – into a number of sub-problems. These questions affect DM and R2RML. Eric and I have been discussing back and forth, but the discussion would quite benefit from some wider input.

Best,
Richard


== Should mapped literals be in canonical form? ==

So, when a DOUBLE is mapped to a literal, is it ok to produce "100.0"^^xsd:double or does it have to be the canonical "1.0E+2"?

Pro:
* Increases the probability that merging data from different sources Just Works

Con:
* Makes cheap&cheerful implementations harder
* Most implementations will get it wrong anyways
* Canonical forms in XSD have changed in the past, undermining the merging argument
* No other RDF-related spec requires generation of canonical forms, undermining the merging argument
* C14N is the job of the query layer and D-Entailment, and not the job of data producers
* Even if merging works, then it only works for equality tests, not for ordering etc

IMO the implementation cost is significant, and the benefits won't actually materialize, so I'm clearly against it. No other spec does this!


== Should data be C14N'd before used in IRI generation? ==

This is about the case when, say, a TIMESTAMP is used in a primary key, or in an R2RML rr:template, so its value becomes part of an IRI.

Pro:
* C14N is required to guarantee that the same IRIs are generated in any implementation
* IRI comparison is required to be character-by-character

Con:
* Makes cheap&cheerful implementations harder
* Most implementations will get it wrong anyways
* In R2RML one can format the string using an R2RML view

I find the Pro arguments quite compelling here.


== How should datatype overrides be handled? ==

This is about the case when the natural datatype mapping is overridden using an explicit rr:datatype property in R2RML.

The spec currently says that the canonical string value from SQL is directly used, without mapping to an XSD form. So we would get TRUE for a boolean, and a value with a space inside for a TIMESTAMP. This is probably the wrong way of doing it, and the spec should be changed.

But OTOH this is probably no big deal because I expect datatype overrides to be rare, and when they are used then it's probably with an R2RML view that customizes the desired lexical form.

I'd still prefer to change this, although it would be a normative, behaviour-modifying change, and we should be careful with those after LC :-(


== Should canonical forms be SQL canonical or XSD canonical? ==

See earlier message for the differences:
http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2011Nov/0047.html

Pro SQL:
* Likely to be easier (free) to implement
* Can easily be specified precisely
* Doesn't throw away time zones

Pro XSD:
* RDF people generally prefer to see XSD forms
* Other RDF-producing apps, if they canonicalize at all, will canonicalize to XSD

The pro-XSD case is somewhat reasonable when generating literals. The answer really doesn't matter when generating IRIs. IRIs are supposed to be opaque. Sure, some people will violate web architecture and hack the IRI apart to get the dateTime value that was used to generate the IRI. But they can't expect that we canonicalize IRIs for them. So, the precise form of the IRI doesn't matter as long as all R2RML/DM implementations generate the same one.


== Should unknown vendor-specific types be mapped to plain literals? ==

Of course, implementations would be free and encouraged to map known vendor-specific types to appropriate XSD types.

The alternative is to just leave them undefined.

Pro:
* Users are well-served by getting the value as a plain literal; it allows them to work with the data in queries
* Given the many many vendor-specific types, all implementations are likely to define fallback behaviour for unknown types. So why not normatively specify the fallback behaviour?

Con:
* If implementation A has special handling for some vendor type X, and implementation B doesn't, then both will produce different literals for the type

I find the con argument not compelling – if we leave it undefined, then surely they will produce different literals too.


== Should “recipes” for converting the SQL forms to XSD be presented? ==

Stuff like: “For DATETIME, do CAST(xxx AS VARCHAR) and then replace the 11th character with a “T”. R2RML currently does it. The DM did a bit in LC, but the latest ED removes all of that.

Pro:
* If we don't do it, average developers are more likely to take shortcuts, leading to interop problems
* Working SQL expressions + lots of examples seems to be useful for implementers and likely to improve interop
* Provides a clear and precise specification of behaviour

Con:
* Takes up space
* We might get it wrong
* They won't work in every SQL engine, or might produce different results in some

I come out clearly on the pro side here. Expecting implementers to do the right thing by just saying, “map any DATETIME value to the equivalent canonical xsd:dateTime value” is … cruel. There are enough developers out there who don't have a clue how time zones work or what scientific notation for floating point numbers is or whether SQL floats include -0 and NaN and ±Inf. Expecting them to read up on all that stuff as part of implementing R2RML is not realistic. The spec should provide plenty of guidance and examples to make them aware of edge cases and so on. If we don't do that, lots of implementers will take shortcuts and not properly implement the literal mapping.


== Should the “recipes” be normative? ==

Pro:
* Provides a clear and precise specification of behaviour
* If we write them anyway, then why not make them normative

Con:
* They won't work in every SQL engine, or might produce different results in some
* We might get them wrong

Received on Tuesday, 15 November 2011 20:52:11 UTC