Re: Encouraging canonical serializations of datatypes in RDF from David Booth on 2012-08-01 (public-rdf-comments@w3.org from August 2012)

From: David Booth <david@dbooth.org>
Date: Tue, 31 Jul 2012 22:48:53 -0400
To: "Peter F. Patel-Schneider" <pfpschneider@gmail.com>
Cc: public-rdf-comments <public-rdf-comments@w3.org>
Message-ID: <1343789333.2725.80515.camel@dbooth-laptop>
On Tue, 2012-07-31 at 16:24 -0400, Peter F. Patel-Schneider wrote:
> On 07/31/2012 03:59 PM, David Booth wrote:
> > Hi Peter,
> >
> > On Tue, 2012-07-31 at 15:36 -0400, Peter F. Patel-Schneider wrote:
> >> Hmm.
> >>
> >> Your two examples have different canonical forms in XML.   I do not believe
> >> that going beyond XML canonicalization is a good idea.
> > What downside do you see?
> 
> If RDF goes beyond XML canonicalization is it doing something to XML datatypes 
> that is not part of the XML specification.   This appears to be driving a 
> further wedge between RDF and XML data.

I guess I'm not following what you mean.  For example, the
xsd:datetimeStamp datatype already requires a timezoneFrag to be
specified, and one permissible timezoneFrag is "Z" (meaning UTC).  If
RDF canonicalization suggested that the timezoneFrag always be "Z", what
wedge would that drive between RDF and XML data?

> 
> [...]
> 
> >
> >> In any case, I don't see the point here.  If equality-unique canonical forms
> >> are only encouraged, then applications will still have to do datatype-aware
> >> comparisons.
> > Only if they need to handle all possible data serializations.   If 90%
> > of the available datasets use the canonical forms then many apps will
> > not need to do datatype-aware comparisons, though the ones that need to
> > cover 100% will.
> 
> If even 99.99% of available datasets use the canonical forms then all apps 
> should still be prepared for non-canonical forms.  To do otherwise is to be 
> wrong.  

It would be wrong for *some* apps, but by no means all.  You can't paint
all apps with the same brush.  For example, if there are 100 datasets
available, and 100 apps, and 90 of the datasets use the canonical forms,
and 40 of the apps only need the datasets that use the canonical forms,
then that substantially lowers the implementation barrier for those 40
apps.

> That is not to say that being wrong is not useful on occasion, but I 
> don't see that there is any good to be had here in the WG suggesting canonical 
> forms be used exclusively.

I just described some substantial good.  I'm not suggesting that
canonicalization be used *exclusively*, but merely that it be
*encouraged*, because it does significantly simplify processing when it
can be used.

> >
> > I think it is important to keep the RDF entry barrier as low as possible
> > whenever possible, in order to support scruffy apps that are good enough
> > for many purposes, even if they don't handle every case.
> >
> > David
> >
> It is important that apps should do the right thing.  For example, should apps 
> ignore character encoding?  How hard is doing datatype-aware processing of 
> literals, compared with all the rest of the stuff that is required to handle RDF?

It depends entirely on the application.  In the case of xsd:datetime,
for example, it means the literal must be completely parsed into its
year, month, day, hour, minute, seconds and timezone offset, and then
datetime arithmetic -- which is *not* simple -- must be used to properly
add the timezone offset in order to compare two values.  All this
instead of a simple, string comparison!  But the worst part is that the
application has to *understand* the different datatypes, and this means
that the code either has to special case every datatype, or it has to
implement some kind of general datatype-handling framework.  Suddenly,
an app that could have been a one-off, three-line perl script blows up
into something that requires significantly more development effort.

The RDF model is so simple.  It would be nice if it could be processed
very simply whenever possible.  "Make the simple cases simple", etc.

> 
> peter
> 
> PS:  Yes, I do use text processors to handle RDF, and quite often, even 
> analysing the 2011 Billion Triple Challenge triples using sed and grep.   
> However, I check to ensure that the right thing happens.

Right, that's exactly the kind of simplified processing that I think we
should facilitate as often as possible.


-- 
David Booth, Ph.D.
http://dbooth.org/

Opinions expressed herein are those of the author and do not necessarily
reflect those of his employer.
Received on Wednesday, 1 August 2012 02:49:23 UTC