Re: DM and R2RML should use same datatype mapping

* Richard Cyganiak <richard@cyganiak.de> [2011-11-03 11:05+0000]
> On 2 Nov 2011, at 23:52, Eric Prud'hommeaux wrote:
> > I used to think it was a good idea to follow the SQL spec 'cause, well, it existed. I've come to believe that it's unnecessarily complex without being helpful.
> >  MySQL doesn't parse "VARYING" (I haven't tried MS SQL yet, 10 mins remaining on download and probably 1hr of screwing around before i know the equivalent of "mysql -u root" or "sudo su postgres -c psql").
> >  We don't know how consistent most databases are if they do implement casting to variable-width strings.
> >  We don't know how closely SQL's casting produces canonical XSD lexical forms.
> 
> Well, I like MySQL and everything, but the fact that one DB doesn't support it doesn't mean that we shouldn't normatively reference a relevant standard.
> 
> >  Folks will mostly be executing some driver-specific (mysql_fetch_fields(result)) or generic (SQLGetData(...)) values without pushing the canonicalization into the SQL statement (SELECT CAST(Employee.salary AS CHARACTER VARYING(16)) FROM ... ).
> 
> These drivers generally have a function that returns a string representation of the value, and that string representation tends to be the result of that expression, or related to the result of that expression in some obvious and predictable way. The R2RML spec has examples that are intended to allow an implementer to verify whether their specific DB access mechanism returns results in the right form.
> 
> > I think we can more specify usefully consistent behavior by saying "produce the xsd canonical form" than by providing the SQL recipe.
> 
> I doubt that. In practice, most implementers will probably go by the examples that are provided in the spec, and by whatever test cases we provide.
> 
> Good databases *do* support the CAST operation or have an equivalent to that, and I think referring to something that is already implemented in the input database (*and* well-specified) is much more useful than a reference to scattered definitions in the XSD specs.
> 
> I have not checked whether asking for XSD canonical form would lead to different results from those specced in R2RML, but if it does, I would again say that requiring the version that can be produced directly from the DB is better than requiring another transformation that doesn't really have any benefit.

Aha, this is perhaps the crux of our different strategies. I believe that the canonical forms are what *everyone* should produce. By specifying the exact lexical form of the terms (as both of us are attempting), we avoid a proliferation of homologous terms in the RDF graphs which come from RDBs. For example, in the RDB-derived graphs, Bob's birth datetime is the same terms as the datetime at which his mother gave birth, not just computationally equivalent (1995-03-18T01:23:45Z vs. 1995-03-18T03:23:45+0200). I think we'd be good RDF datizens if we used the same form that we expect other producers of RDF data to use.


> > I think an argument against this would be if there were some question about how the value spaces mapped, but I don't see that as an issue because, apart from range extremes and precision, ints are ints and doubles are doubles.
> 
> This may be the case for ints. It is less so for doubles — there are enough questions about E notation, ±Inf, NaN, required leading or trailing zeroes and so on, to make this a rather complex transformation for anyone who isn't a datatype geek like the two of us. And when we're talking about dateTimes with time zones, I think that very few people have a solid understanding of what the value space of that even is.

Are we confident that we can provide recipies for these notations that produce valid (by XSD's metric) forms? We of course don't want to encourage the production of poorly-formed literals.


> >> Would this table say anything besides what's already in the XSD spec? If not, then I'd rather just add an informative note pointing to that document, as it does a better job at explaining the relevant issues than we can do in the appropriate amount of space.
> > 
> > Well, there's the question about DATETIMEs and DATEs which have a much larger data space in XML than they do in SQL.
> 
> Ok – as long as the SQL value space is covered by the XSD types, this doesn't need to be discussed in the specs I think.
> 
> >> How about this:
> >> 
> >> [[
> >> Note: This mapping to XML Schema datatypes tends to rely on arbitrary-precision types such as xsd:decimal, xsd:integer and xsd:dateTime, since the precision of many SQL datatypes is not fixed in [SQL2], but implementation-defined.
> >> 
> >> In practice, implementation considerations may require setting upper limits to the supported precision of these types. The XML Schema specification allows such “partial implementations”, and defines some minimum requirements and other constraints [XSD2-11].
> >> ]]
> > 
> > I didn't reallize they were “partial implementations”
> 
> The section heading in XSD is “Partial Implementation of Infinite Datatypes”. No formal term is defined except “minimally conforming” processors.
> 
> > but I certainly want to be consistent with XSD's language. I prefer explicitly spelling out an expected supported data space with extension guidance,
> 
> Nice way of putting it. As you know, I prefer the model where the entire datatype space of SQL 2008 is mapped, with some guidance for limited implementations. And it's very handy that XSD 1.1 already provides such guidance.

Tx! felt right at the time.

> > but if folks prefer a less-defined demarcation, I think this text is a good start.
> 
> Well, let's see what others think in the call.
> 
> Best,
> Richard
> 
> 
> > 
> > 
> >> The [XSD2-11] link would go to http://www.w3.org/TR/xmlschema11-2/#partial-implementation . This section only exists in XML Schema 1.1, which isn't REC yet, so depending on their timeline we may have to link both to XSD 1.0 (normatively) and XSD 1.1 (informatively).
> >> 
> >>> and examples of the canonical forms?
> >> 
> >> R2RML already has plenty of examples for the conversion process, downwards from here:
> >> http://www.w3.org/2001/sw/rdb2rdf/r2rml/#datatype-table
> >> 
> >> (Looking at this now I feel that the table at the end of 10.1, and the one in 10.2, ought to be combined into one.)
> > 
> > I believe that everything can be condensed into one table.
> > 
> > 
> >> Best,
> >> Richard
> >> 
> >> 
> >>> 
> >>> 
> >>>> Best,
> >>>> Richard
> >>>> 
> >>>> [1] http://www.w3.org/TR/xmlschema11-2/#partial-implementation
> >>>> 
> >>>> 
> >>>> On 31 Oct 2011, at 03:39, Eric Prud'hommeaux wrote:
> >>>> 
> >>>>> * Richard Cyganiak <richard@cyganiak.de> [2011-10-31 00:09+0000]
> >>>>>> On 30 Oct 2011, at 23:11, Eric Prud'hommeaux wrote:
> >>>>>>> A tool which uses e.g. floats or ints to manipulate the graph defined by R2RML would have to qualify its conformance by the version of the database to which it was connected (e.g. "offers R2RML for MySQL 5.01, but not Oracle 11G").
> >>>>>> 
> >>>>>> Neither floats nor ints are sufficient to represent xsd:decimal even if we consider only xsd:decimals restricted to 18 digits.
> >>>>> 
> >>>>> True, and that does raise the bar for implementation. However, floating point and integer types are very commonly used in SQL and can be very simply implemented.
> >>>>> 
> >>>>> 
> >>>>>> Any programming language these days has some sort of arbitrary-precision decimal type in a readily available library. That is sufficient for conformance with any SQL 2008 conforming implementation of DECIMAL, regardless of how many digits it uses.
> >>>>>> 
> >>>>>>> General compatibility with R2RML over any database can only be preserved if you don't use native types at any step of the e.g. query answering process.
> >>>>>> 
> >>>>>> I have no idea what you're trying to say here.
> >>>>> 
> >>>>> As you point out above, one needs to use arbitrary-precision decimals and not native datatypes to implement the arbitrary precision required by R2RML.
> >>>>> Some programs, e.g. Jena, use efficient native types for integers and arbitrary-precision only for decimals.
> >>>>> 
> >>>>> 
> >>>>>>> Applying the unbounded precision support to DM would mean that FeDeRate would no longer be an implementation (it uses Jena to parse and execute queries which I believe uses java native types)
> >>>>>> 
> >>>>>> You may want to check that again. Jena uses BigDecimal to represent xsd:decimal.
> >>>>> 
> >>>>> The query
> >>>>> ASK {FILTER (20000000000000000000/2=10000000000000000000)}
> >>>>> at <http://sparql.org/sparql.html> indicates that ARC supports up to, but no more than, 18 digit integers.
> >>>>> 
> >>>>> 
> >>>>>>> and SWObjects would have an even harder time as it is intended to connect multiple databases with potentially different maximum precisions.
> >>>>>> 
> >>>>>> I don't understand the problem. When you query the DB you get back some value. Then you stuff that value into a BigDecimal.
> >>>>>> 
> >>>>>> I don't understand how knowing that you're never going to see a decimal longer than 18 digits simplifies an implementation. It's not like it's particularly hard to write arbitrary-precision code.
> >>>>> 
> >>>>> True, but do the use cases motivate raising the bar to that extent? Can we motivate Jena abandoning native integers?
> >>>>> 
> >>>>> 
> >>>>>> As far as I can see, the text in R2RML works fine, is easy to implement, easy to test, and meets user expectations. I have seen no evidence yet that changing the text would benefit users or implementers, and I have seen no argument being made why R2RML and DM should differ. As far as I can tell, you're trying to solve an imaginary problem.
> >>>>> 
> >>>>> I don't foresee many implementations of arbitrary precision for integers and floats and I don't see much motivation for that. Further, it makes more sense to define the lexical values in terms of the XSD canonical types rather than via a recipe which some popular databases (e.g. MySQL) don't support. 
> >>>>> 
> >>>>> 
> >>>>>> Best,
> >>>>>> Richard
> >>>>> 
> >>>>> -- 
> >>>>> -ericP
> >>>>> 
> >>>> 
> >>> 
> >>> -- 
> >>> -ericP
> >> 
> > 
> > -- 
> > -ericP
> > 
> 

-- 
-ericP

Received on Thursday, 3 November 2011 13:13:13 UTC