Re: Request for comments: Breaking down the datatype mapping problem (ISSUE-69)

Mariano,

On 16 Nov 2011, at 14:15, Mariano Rodriguez wrote:
> Let me first introduce myself. I one of the developers of Quest [1], it is a new triple store/reasoner that has the peculiarity that it's able to deal with mapping in a "virtual" way, we avoid actually materializing the triples, and taking into account RDFS and OWL 2 QL inferences. All is done by means of query rewriting. 

Very cool! Welcome to the list!

> Right now we focus on use case 1 from http://www.w3.org/TR/rdb2rdf-ucr/ but we plan also to support use cases 2 and 3. Both DM and R2RML are relevant for us and we will have an implementation early next year, which is why I want to comment on the handling on datatypes. In our case, doing pre-processing and manipulation of the data can easily complicate things.

I understand.

> I hope the votes help in taking a decision. 

They do, thanks for sharing! I answered one question inline below, at the very end of your message.

Please keep us updated on your progress with implementing the DM and R2RML :-)

Thanks again,
Richard



> 
> Best regards,
> Mariano
> 
> [1] http://obda.inf.unibz.it/protege-plugin/quest/quest.html
> 
>> So, when a DOUBLE is mapped to a literal, is it ok to produce "100.0"^^xsd:double or does it have to be the canonical "1.0E+2"?
>> 
>> Pro:
>> * Increases the probability that merging data from different sources Just Works
>> 
>> Con:
>> * Makes cheap&cheerful implementations harder
>> * Most implementations will get it wrong anyways
>> * Canonical forms in XSD have changed in the past, undermining the merging argument
>> * No other RDF-related spec requires generation of canonical forms, undermining the merging argument
>> * C14N is the job of the query layer and D-Entailment, and not the job of data producers
>> * Even if merging works, then it only works for equality tests, not for ordering etc
>> 
>> IMO the implementation cost is significant, and the benefits won't actually materialize, so I'm clearly against it. No other spec does this!
>> 
> 
> Here I would really support not to use canonical forms. In our case this would greatly complicate query rewriting and 
> could kill performance. 
> 
>> 
>> == Should data be C14N'd before used in IRI generation? ==
>> 
>> This is about the case when, say, a TIMESTAMP is used in a primary key, or in an R2RML rr:template, so its value becomes part of an IRI.
>> 
>> Pro:
>> * C14N is required to guarantee that the same IRIs are generated in any implementation
>> * IRI comparison is required to be character-by-character
>> 
>> Con:
>> * Makes cheap&cheerful implementations harder
>> * Most implementations will get it wrong anyways
>> * In R2RML one can format the string using an R2RML view
>> 
>> I find the Pro arguments quite compelling here.
> 
> Again here I would vote against c14n, due to performance. The R2RML views can be used by the users to deal with this kind of issue.
> 
>> == Should unknown vendor-specific types be mapped to plain literals? ==
>> 
>> Of course, implementations would be free and encouraged to map known vendor-specific types to appropriate XSD types.
>> 
>> The alternative is to just leave them undefined.
>> 
>> Pro:
>> * Users are well-served by getting the value as a plain literal; it allows them to work with the data in queries
>> * Given the many many vendor-specific types, all implementations are likely to define fallback behaviour for unknown types. So why not normatively specify the fallback behaviour?
>> 
>> Con:
>> * If implementation A has special handling for some vendor type X, and implementation B doesn't, then both will produce different literals for the type
>> 
>> I find the con argument not compelling – if we leave it undefined, then surely they will produce different literals too.
> 
> Here the vote is for defining a fallback behavior in the spec.
> 
>> 
>> == Should “recipes” for converting the SQL forms to XSD be presented? ==
>> 
>> Stuff like: “For DATETIME, do CAST(xxx AS VARCHAR) and then replace the 11th character with a “T”. R2RML currently does it. The DM did a bit in LC, but the latest ED removes all of that.
>> 
>> Pro:
>> * If we don't do it, average developers are more likely to take shortcuts, leading to interop problems
>> * Working SQL expressions + lots of examples seems to be useful for implementers and likely to improve interop
>> * Provides a clear and precise specification of behaviour
>> 
>> Con:
>> * Takes up space
>> * We might get it wrong
>> * They won't work in every SQL engine, or might produce different results in some
>> 
> 
> Please do provide recipes.
> 
>> I come out clearly on the pro side here. Expecting implementers to do the right thing by just saying, “map any DATETIME value to the equivalent canonical xsd:dateTime value” is … cruel. There are enough developers out there who don't have a clue how time zones work or what scientific notation for floating point numbers is or whether SQL floats include -0 and NaN and ±Inf. Expecting them to read up on all that stuff as part of implementing R2RML is not realistic. The spec should provide plenty of guidance and examples to make them aware of edge cases and so on. If we don't do that, lots of implementers will take shortcuts and not properly implement the literal mapping.
> 
> Indeed! I strongly support these arguments. Not having recipes will greatly complicate the implementation effort, plus not being experts in the topic we are prone to errors.
> 
>> == Should the “recipes” be normative? ==
>> 
>> Pro:
>> * Provides a clear and precise specification of behaviour
>> * If we write them anyway, then why not make them normative
>> 
>> Con:
>> * They won't work in every SQL engine, or might produce different results in some
>> * We might get them wrong
> 
> Could it be made normative for some engines, letting other engines be handled by the implementor?

If the recipes are normative, that just means that any implementation has to behave *as if* it was using the recipe. Whether it actually uses them, or uses some other mechanism that produces the same results, is entirely up to the implementer.

The difference between “normative” and “informative” text in a spec is that the normative stuff defines what an implementation has to do in order to be called conforming, while the informative material is just additional explanation around that. So the informative material can be ignored without changing the “meaning” of the spec. Most importantly, should the normative and informative text ever disagree, then the normative text has precedence. Normative text has to be very precise, and therefore it's not always easy to read. Then we add informative material (like examples, notes, diagrams) to help understanding the difficult parts.

Received on Wednesday, 16 November 2011 22:26:22 UTC