Re: Request for comments: Breaking down the datatype mapping problem (ISSUE-69)

Dear Richard, 

I take the liberty to cast a vote on some of the issues you mentioned in the previous email. 

Let me first introduce myself. I one of the developers of Quest [1], it is a new triple store/reasoner that has the peculiarity that it's able to deal with mapping in a "virtual" way, we avoid actually materializing the triples, and taking into account RDFS and OWL 2 QL inferences. All is done by means of query rewriting. 

Right now we focus on use case 1 from http://www.w3.org/TR/rdb2rdf-ucr/ but we plan also to support use cases 2 and 3. Both DM and R2RML are relevant for us and we will have an implementation early next year, which is why I want to comment on the handling on datatypes. In our case, doing pre-processing and manipulation of the data can easily complicate things.

I hope the votes help in taking a decision. 

Best regards,
Mariano

[1] http://obda.inf.unibz.it/protege-plugin/quest/quest.html

> So, when a DOUBLE is mapped to a literal, is it ok to produce "100.0"^^xsd:double or does it have to be the canonical "1.0E+2"?
> 
> Pro:
> * Increases the probability that merging data from different sources Just Works
> 
> Con:
> * Makes cheap&cheerful implementations harder
> * Most implementations will get it wrong anyways
> * Canonical forms in XSD have changed in the past, undermining the merging argument
> * No other RDF-related spec requires generation of canonical forms, undermining the merging argument
> * C14N is the job of the query layer and D-Entailment, and not the job of data producers
> * Even if merging works, then it only works for equality tests, not for ordering etc
> 
> IMO the implementation cost is significant, and the benefits won't actually materialize, so I'm clearly against it. No other spec does this!
> 

Here I would really support not to use canonical forms. In our case this would greatly complicate query rewriting and 
could kill performance. 

> 
> == Should data be C14N'd before used in IRI generation? ==
> 
> This is about the case when, say, a TIMESTAMP is used in a primary key, or in an R2RML rr:template, so its value becomes part of an IRI.
> 
> Pro:
> * C14N is required to guarantee that the same IRIs are generated in any implementation
> * IRI comparison is required to be character-by-character
> 
> Con:
> * Makes cheap&cheerful implementations harder
> * Most implementations will get it wrong anyways
> * In R2RML one can format the string using an R2RML view
> 
> I find the Pro arguments quite compelling here.

Again here I would vote against c14n, due to performance. The R2RML views can be used by the users to deal with this kind of issue.

> == Should unknown vendor-specific types be mapped to plain literals? ==
> 
> Of course, implementations would be free and encouraged to map known vendor-specific types to appropriate XSD types.
> 
> The alternative is to just leave them undefined.
> 
> Pro:
> * Users are well-served by getting the value as a plain literal; it allows them to work with the data in queries
> * Given the many many vendor-specific types, all implementations are likely to define fallback behaviour for unknown types. So why not normatively specify the fallback behaviour?
> 
> Con:
> * If implementation A has special handling for some vendor type X, and implementation B doesn't, then both will produce different literals for the type
> 
> I find the con argument not compelling – if we leave it undefined, then surely they will produce different literals too.

Here the vote is for defining a fallback behavior in the spec.

> 
> == Should “recipes” for converting the SQL forms to XSD be presented? ==
> 
> Stuff like: “For DATETIME, do CAST(xxx AS VARCHAR) and then replace the 11th character with a “T”. R2RML currently does it. The DM did a bit in LC, but the latest ED removes all of that.
> 
> Pro:
> * If we don't do it, average developers are more likely to take shortcuts, leading to interop problems
> * Working SQL expressions + lots of examples seems to be useful for implementers and likely to improve interop
> * Provides a clear and precise specification of behaviour
> 
> Con:
> * Takes up space
> * We might get it wrong
> * They won't work in every SQL engine, or might produce different results in some
> 

Please do provide recipes.

> I come out clearly on the pro side here. Expecting implementers to do the right thing by just saying, “map any DATETIME value to the equivalent canonical xsd:dateTime value” is … cruel. There are enough developers out there who don't have a clue how time zones work or what scientific notation for floating point numbers is or whether SQL floats include -0 and NaN and ±Inf. Expecting them to read up on all that stuff as part of implementing R2RML is not realistic. The spec should provide plenty of guidance and examples to make them aware of edge cases and so on. If we don't do that, lots of implementers will take shortcuts and not properly implement the literal mapping.

Indeed! I strongly support these arguments. Not having recipes will greatly complicate the implementation effort, plus not being experts in the topic we are prone to errors.

> == Should the “recipes” be normative? ==
> 
> Pro:
> * Provides a clear and precise specification of behaviour
> * If we write them anyway, then why not make them normative
> 
> Con:
> * They won't work in every SQL engine, or might produce different results in some
> * We might get them wrong

Could it be made normative for some engines, letting other engines be handled by the implementor?

Received on Wednesday, 16 November 2011 14:16:06 UTC