Re: DM and R2RML should use same datatype mapping from Richard Cyganiak on 2011-11-02 (public-rdb2rdf-comments@w3.org from November 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Wed, 2 Nov 2011 14:52:09 +0000
To: Eric Prud'hommeaux <eric@w3.org>
Cc: public-rdb2rdf-comments@w3.org
Message-Id: <6048BDA2-EC3C-46AA-82CA-645B32880562@cyganiak.de>
On 31 Oct 2011, at 03:39, Eric Prud'hommeaux wrote:
>>> A tool which uses e.g. floats or ints to manipulate the graph defined by R2RML would have to qualify its conformance by the version of the database to which it was connected (e.g. "offers R2RML for MySQL 5.01, but not Oracle 11G").
>> 
>> Neither floats nor ints are sufficient to represent xsd:decimal even if we consider only xsd:decimals restricted to 18 digits.
> 
> True, and that does raise the bar for implementation. However, floating point and integer types are very commonly used in SQL and can be very simply implemented.

Sure, but that is orthogonal to the question how DECIMAL is treated.

>> Any programming language these days has some sort of arbitrary-precision decimal type in a readily available library. That is sufficient for conformance with any SQL 2008 conforming implementation of DECIMAL, regardless of how many digits it uses.
>> 
>>> General compatibility with R2RML over any database can only be preserved if you don't use native types at any step of the e.g. query answering process.
>> 
>> I have no idea what you're trying to say here.
> 
> As you point out above, one needs to use arbitrary-precision decimals and not native datatypes to implement the arbitrary precision required by R2RML.

Well, not necessarily. One needs to use only as much precision as supported in the SQL datatypes (or the actual values) used in the input database. This may require knowledge of the input schema (including its type definitions), but there are APIs for that in any decent database abstraction layer.

> Some programs, e.g. Jena, use efficient native types for integers and arbitrary-precision only for decimals.

Again I'm pretty sure you are wrong. Jena uses native types only for xsd:int, and Java's BigInteger for xsd:integer.

>>> Applying the unbounded precision support to DM would mean that FeDeRate would no longer be an implementation (it uses Jena to parse and execute queries which I believe uses java native types)
>> 
>> You may want to check that again. Jena uses BigDecimal to represent xsd:decimal.
> 
> The query
>  ASK {FILTER (20000000000000000000/2=10000000000000000000)}
> at <http://sparql.org/sparql.html> indicates that ARC supports up to, but no more than, 18 digit integers.

I suppose you mean ARQ? No, it indicates some weird bug in that implementation.

ASK {FILTER (200000/2=100000)} => true
ASK {FILTER (20000000/2=10000000)} => true
ASK {FILTER (2000000000/2=1000000000)} => true
ASK {FILTER (200000000000/2=100000000000)} => true
ASK {FILTER (20000000000000/2=10000000000000)} => true
ASK {FILTER (2000000000000000/2=1000000000000000)} => true
ASK {FILTER (200000000000000000/2=100000000000000000)} => true
ASK {FILTER (20000000000000000000/2=10000000000000000000)} => ***false***
ASK {FILTER (2000000000000000000000/2=1000000000000000000000)} => true
ASK {FILTER (200000000000000000000000/2=100000000000000000000000)} => true
ASK {FILTER (20000000000000000000000000/2=10000000000000000000000000)} => true

You might want to report that to the vendor.

>>> and SWObjects would have an even harder time as it is intended to connect multiple databases with potentially different maximum precisions.
>> 
>> I don't understand the problem. When you query the DB you get back some value. Then you stuff that value into a BigDecimal.
>> 
>> I don't understand how knowing that you're never going to see a decimal longer than 18 digits simplifies an implementation. It's not like it's particularly hard to write arbitrary-precision code.
> 
> True, but do the use cases motivate raising the bar to that extent? Can we motivate Jena abandoning native integers?

Jena already supports arbitrary-precision integers and decimals.

Again, I don't see how arbitrary restrictions of the maximum supported precision simplifies an implementation. Implementing SPARQL over a graph that contains the XSD equivalent of DECIMALs and BIGINTs is impossible with only native types; you *need* higher-precision decimal and integer arithmetic *anyway*. Restricting the DECIMALS to an arbitrary 18 digits and BIGINTs to 13 doesn't change the implementation cost – most likely it makes implementations more complex because they need to check the sizes and add error handling for anything larger.

>> As far as I can see, the text in R2RML works fine, is easy to implement, easy to test, and meets user expectations. I have seen no evidence yet that changing the text would benefit users or implementers, and I have seen no argument being made why R2RML and DM should differ. As far as I can tell, you're trying to solve an imaginary problem.
> 
> I don't foresee many implementations of arbitrary precision for integers and floats and I don't see much motivation for that.

No one suggested requiring *arbitrary* precision. The request is to require support for all the SQL 2008 built-in types, which requires conforming implementations to *match* the implementation-defined yet *limited* sizes of the datatypes in the input database. Arbitrary precision is just a simple way of implementing that for some datatypes.

Any conforming implementation of xsd:integer and xsd:decimal, and any real-world implementation of DECIMAL and BIGINT, *has* to support more digits than fit into a Java int, long or double already.

(R2RML does not support arbitrary precision for floating-point types. It maps them all to xsd:double. This loses precision in certain cases, for example the default DOUBLE in Oracle is 128 bit, while xsd:double is 64 bit. I don't particularly like this, but it's what ISO/IEC 9075-14:2008 does, and the type mapping in R2RML follows that spec [with the exception of not trying to find the smallest possible type, and not handling INTERVAL].)

> Further, it makes more sense to define the lexical values in terms of the XSD canonical types rather than via a recipe which some popular databases (e.g. MySQL) don't support. 

I don't understand what you're trying to say here.

Let me ask again: What's the benefit for users or implementers in having two different value mappings? And what is the benefit for users or implementers in writing the spec in a way that rejects, truncates or excludes long DECIMALs, BIGINTs or DATETIMEs?

Best,
Richard
Received on Wednesday, 2 November 2011 14:52:44 UTC