Re: Request for comments: Breaking down the datatype mapping problem (ISSUE-69) from Richard Cyganiak on 2011-11-21 (public-rdb2rdf-wg@w3.org from November 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 21 Nov 2011 14:48:53 +0000
To: Joerg Unbehauen <unbehauen@informatik.uni-leipzig.de>
Cc: public-rdb2rdf-wg@w3.org
Message-Id: <476F89FF-E8FF-4CCC-8CF9-B3AF1D52C4C9@cyganiak.de>
Jörg, thanks for the comments. I'm still hoping to get a reaction from Eric, and then we'll maybe know where to take this next.

Regarding your last comment on informative versus normative, see the end of this message here:
http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2011Nov/0050.html

Thanks again,
Richard


On 17 Nov 2011, at 17:01, Joerg Unbehauen wrote:

> Hi,
> 
> while i went through the issue, i got the impression, that the first question to answer should be, if the spec uses SQL or XSD canonical form.
> 
> Using SQL as canonical form (as in the sense of Richards mail) would lower the implementation effort a lot and together with (informative) recipes invalidate some of arguments speaking against applying a mapping to literals and iris.
> 
> While i prefer a spec that requires as little implementation as possible, consistent IRIs are important and worth some effort.
> 
> See the rest for more opinion:
> 
> On 11/15/2011 09:51 PM, Richard Cyganiak wrote:
>> I've tried to break down the datatype mapping question – ISSUE-69 – into a number of sub-problems. These questions affect DM and R2RML. Eric and I have been discussing back and forth, but the discussion would quite benefit from some wider input.
>> 
>> Best,
>> Richard
>> 
>> 
>> == Should mapped literals be in canonical form? ==
>> 
>> So, when a DOUBLE is mapped to a literal, is it ok to produce "100.0"^^xsd:double or does it have to be the canonical "1.0E+2"?
>> 
>> Pro:
>> * Increases the probability that merging data from different sources Just Works
>> 
>> Con:
>> * Makes cheap&cheerful implementations harder
>> * Most implementations will get it wrong anyways
>> * Canonical forms in XSD have changed in the past, undermining the merging argument
>> * No other RDF-related spec requires generation of canonical forms, undermining the merging argument
>> * C14N is the job of the query layer and D-Entailment, and not the job of data producers
>> * Even if merging works, then it only works for equality tests, not for ordering etc
>> 
>> IMO the implementation cost is significant, and the benefits won't actually materialize, so I'm clearly against it. No other spec does this!
>> 
> same for me, i'd prefer to go with as little implementation as necessary. but see for the rest.
>> 
>> == Should data be C14N'd before used in IRI generation? ==
>> 
>> This is about the case when, say, a TIMESTAMP is used in a primary key, or in an R2RML rr:template, so its value becomes part of an IRI.
>> 
>> Pro:
>> * C14N is required to guarantee that the same IRIs are generated in any implementation
>> * IRI comparison is required to be character-by-character
>> 
>> Con:
>> * Makes cheap&cheerful implementations harder
>> * Most implementations will get it wrong anyways
>> * In R2RML one can format the string using an R2RML view
>> 
>> I find the Pro arguments quite compelling here.
> +1 from me too. Although more to implement, consistent uris are important.
> But on the other hand, if you implement c14n here, than why not apply it as well to literals? Well sure, it could have an impact on performance ...
> But in the case of iris, consistency is important.
>> 
>> 
>> == How should datatype overrides be handled? ==
>> 
>> This is about the case when the natural datatype mapping is overridden using an explicit rr:datatype property in R2RML.
>> 
>> The spec currently says that the canonical string value from SQL is directly used, without mapping to an XSD form. So we would get TRUE for a boolean, and a value with a space inside for a TIMESTAMP. This is probably the wrong way of doing it, and the spec should be changed.
>> 
>> But OTOH this is probably no big deal because I expect datatype overrides to be rare, and when they are used then it's probably with an R2RML view that customizes the desired lexical form.
>> 
>> I'd still prefer to change this, although it would be a normative, behaviour-modifying change, and we should be careful with those after LC :-(
>> 
> Sorry, no real opinion on this.
>> 
>> == Should canonical forms be SQL canonical or XSD canonical? ==
>> 
>> See earlier message for the differences:
>> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2011Nov/0047.html
>> 
>> Pro SQL:
>> * Likely to be easier (free) to implement
>> * Can easily be specified precisely
>> * Doesn't throw away time zones
>> 
>> Pro XSD:
>> * RDF people generally prefer to see XSD forms
>> * Other RDF-producing apps, if they canonicalize at all, will canonicalize to XSD
>> 
>> The pro-XSD case is somewhat reasonable when generating literals. The answer really doesn't matter when generating IRIs. IRIs are supposed to be opaque. Sure, some people will violate web architecture and hack the IRI apart to get the dateTime value that was used to generate the IRI. But they can't expect that we canonicalize IRIs for them. So, the precise form of the IRI doesn't matter as long as all R2RML/DM implementations generate the same one.
> To my mind, choosing sql canonical form could provide a good balance between implementation effort and interoperability.
> But than it should be applied to iri's and literals.
> 
>> 
>> 
>> == Should unknown vendor-specific types be mapped to plain literals? ==
>> 
>> Of course, implementations would be free and encouraged to map known vendor-specific types to appropriate XSD types.
>> 
>> The alternative is to just leave them undefined.
>> 
>> Pro:
>> * Users are well-served by getting the value as a plain literal; it allows them to work with the data in queries
>> * Given the many many vendor-specific types, all implementations are likely to define fallback behaviour for unknown types. So why not normatively specify the fallback behaviour?
>> 
>> Con:
>> * If implementation A has special handling for some vendor type X, and implementation B doesn't, then both will produce different literals for the type
>> 
>> I find the con argument not compelling – if we leave it undefined, then surely they will produce different literals too.
>> 
> Same for me, a fallback should be required.
>> 
>> == Should “recipes” for converting the SQL forms to XSD be presented? ==
>> 
>> Stuff like: “For DATETIME, do CAST(xxx AS VARCHAR) and then replace the 11th character with a “T”. R2RML currently does it. The DM did a bit in LC, but the latest ED removes all of that.
>> 
>> Pro:
>> * If we don't do it, average developers are more likely to take shortcuts, leading to interop problems
>> * Working SQL expressions + lots of examples seems to be useful for implementers and likely to improve interop
>> * Provides a clear and precise specification of behaviour
>> 
>> Con:
>> * Takes up space
>> * We might get it wrong
>> * They won't work in every SQL engine, or might produce different results in some
>> 
>> I come out clearly on the pro side here. Expecting implementers to do the right thing by just saying, “map any DATETIME value to the equivalent canonical xsd:dateTime value” is … cruel. There are enough developers out there who don't have a clue how time zones work or what scientific notation for floating point numbers is or whether SQL floats include -0 and NaN and ±Inf. Expecting them to read up on all that stuff as part of implementing R2RML is not realistic. The spec should provide plenty of guidance and examples to make them aware of edge cases and so on. If we don't do that, lots of implementers will take shortcuts and not properly implement the literal mapping.
> 
> +1 for recipies.
>> 
>> 
>> == Should the “recipes” be normative? ==
>> 
>> Pro:
>> * Provides a clear and precise specification of behaviour
>> * If we write them anyway, then why not make them normative
>> 
>> Con:
>> * They won't work in every SQL engine, or might produce different results in some
>> * We might get them wrong
>> 
> I'd stick to informative.
> For example: mysql (or ar least some versions) does not allow a cast to varchar atm. So the recipe cannot be applied 1:1. So all implementations targeting mysql would be non-conforming. Which would not be right im my eyes.
> 
> 
> Best Regards,
> 
> 
> Joerg
> 
>
Received on Monday, 21 November 2011 14:49:33 UTC