Re: Encouraging canonical serializations of datatypes in RDF from Andy Seaborne on 2012-08-01 (public-rdf-comments@w3.org from August 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Wed, 01 Aug 2012 10:07:07 +0100
To: public-rdf-comments@w3.org
Message-ID: <5018F1BB.7050702@epimorphics.com>
The majority of use for RDF for apps I'm involved in at the moment are 
all in the same time place.

Changing the data as it goes through the system, and hence breaking the 
display aspect of the data, is a complete non-starter.  I want to know 
what timezone the dateTime started as.

Display is more important than comparison.

And, from time spent doing support, the first user expectation is that 
stuff that comes out looks like what went in.  Not changing the date 
part sometimes.

"2012-12-31T22:00:00-05:00"^^xsd:dateTime

is the same time point as

"2013-01-01T03:00:00Z"^^xsd:dateTime

It's a different year.

 Andy

On 01/08/12 03:57, Peter F. Patel-Schneider wrote:
>
> On 07/31/2012 10:48 PM, David Booth wrote:
>> On Tue, 2012-07-31 at 16:24 -0400, Peter F. Patel-Schneider wrote:
>>> On 07/31/2012 03:59 PM, David Booth wrote:
>>>> Hi Peter,
>>>>
>>>> On Tue, 2012-07-31 at 15:36 -0400, Peter F. Patel-Schneider wrote:
>>>>> Hmm.
>>>>>
>>>>> Your two examples have different canonical forms in XML.   I do not
>>>>> believe
>>>>> that going beyond XML canonicalization is a good idea.
>>>> What downside do you see?
>>> If RDF goes beyond XML canonicalization is it doing something to XML
>>> datatypes
>>> that is not part of the XML specification.   This appears to be
>>> driving a
>>> further wedge between RDF and XML data.
>> I guess I'm not following what you mean.  For example, the
>> xsd:datetimeStamp datatype already requires a timezoneFrag to be
>> specified, and one permissible timezoneFrag is "Z" (meaning UTC).  If
>> RDF canonicalization suggested that the timezoneFrag always be "Z", what
>> wedge would that drive between RDF and XML data?
>
> It would say that as far as RDF is concerned, XML data that doesn't use
> Z is somehow second class.
>>
>>> [...]
>>>
>>>>> In any case, I don't see the point here.  If equality-unique
>>>>> canonical forms
>>>>> are only encouraged, then applications will still have to do
>>>>> datatype-aware
>>>>> comparisons.
>>>> Only if they need to handle all possible data serializations.   If 90%
>>>> of the available datasets use the canonical forms then many apps will
>>>> not need to do datatype-aware comparisons, though the ones that need to
>>>> cover 100% will.
>>> If even 99.99% of available datasets use the canonical forms then all
>>> apps
>>> should still be prepared for non-canonical forms.  To do otherwise is
>>> to be
>>> wrong.
>> It would be wrong for *some* apps, but by no means all.  You can't paint
>> all apps with the same brush.  For example, if there are 100 datasets
>> available, and 100 apps, and 90 of the datasets use the canonical forms,
>> and 40 of the apps only need the datasets that use the canonical forms,
>> then that substantially lowers the implementation barrier for those 40
>> apps.
> As long as these apps only use the 90, and stay away from the 10. This
> appears to break one of the prime motivations of RDF, that all data can
> be used by anyone.
>>
>>> That is not to say that being wrong is not useful on occasion, but I
>>> don't see that there is any good to be had here in the WG suggesting
>>> canonical
>>> forms be used exclusively.
>> I just described some substantial good.  I'm not suggesting that
>> canonicalization be used *exclusively*, but merely that it be
>> *encouraged*, because it does significantly simplify processing when it
>> can be used.
>
> I don't see the "significantly" here at all.
>>
>>>> I think it is important to keep the RDF entry barrier as low as
>>>> possible
>>>> whenever possible, in order to support scruffy apps that are good
>>>> enough
>>>> for many purposes, even if they don't handle every case.
>>>>
>>>> David
>>>>
>>> It is important that apps should do the right thing.  For example,
>>> should apps
>>> ignore character encoding?  How hard is doing datatype-aware
>>> processing of
>>> literals, compared with all the rest of the stuff that is required to
>>> handle RDF?
>> It depends entirely on the application.  In the case of xsd:datetime,
>> for example, it means the literal must be completely parsed into its
>> year, month, day, hour, minute, seconds and timezone offset, and then
>> datetime arithmetic -- which is *not* simple -- must be used to properly
>> add the timezone offset in order to compare two values.  All this
>> instead of a simple, string comparison!  But the worst part is that the
>> application has to *understand* the different datatypes, and this means
>> that the code either has to special case every datatype, or it has to
>> implement some kind of general datatype-handling framework.  Suddenly,
>> an app that could have been a one-off, three-line perl script blows up
>> into something that requires significantly more development effort.
>>
>> The RDF model is so simple.  It would be nice if it could be processed
>> very simply whenever possible.  "Make the simple cases simple", etc.
>
> The simplicity of the RDF model is, in my mind, tied up with its
> uniformity. Your proposal severely breaks that uniformity, which is a
> major lossage.
>>
>>> peter
>>>
>>> PS:  Yes, I do use text processors to handle RDF, and quite often, even
>>> analysing the 2011 Billion Triple Challenge triples using sed and grep.
>>> However, I check to ensure that the right thing happens.
>> Right, that's exactly the kind of simplified processing that I think we
>> should facilitate as often as possible.
>>
>>
> Sure, as long as it is only in one-off hacks, controlled by experts, who
> can adjust the processing according to the peculiarities of the input.
> As soon as direct expert control goes away, then the app needs to be
> able to consume all RDF, which I see as counter to your proposal.
>
> peter
>
>
>
>
Received on Wednesday, 1 August 2012 09:07:51 UTC