Re: Encouraging canonical serializations of datatypes in RDF

Hi Gavin,

On Tue, 2012-07-31 at 12:57 -0700, Gavin Carothers wrote:
> On Tue, Jul 31, 2012 at 11:31 AM, David Booth <david@dbooth.org> wrote:
[ . . . ]
> > A particular case in point: xsd:datetime.
> >
> >   "2012-07-31T17:16:00+01:00"^^xsd:dateTime
> >
> > represents the same point in time as
> >
> >   "2012-07-31T16:16:00Z"^^xsd:dateTime
> 
> No, it doesn't. This is a common misunderstanding regarding date
> times. The time zone is NOT a meaningless value. xsd:dateTime happily
> gets this right in the timezoneCanonicalFragmentMap
> http://www.w3.org/TR/xmlschema11-2/#f-tzCanFragMap

Can you explain?  I just tested the above example using the Perl
DateTime::Format::XSD library (to be sure I hadn't made a silly typo),
and it says that they represent the exact same point in time.  If you
think that library is wrong, I'd like to know why.

> 
> >
> > but the strings are not the same.  This could be avoided by encouraging
> > a canonical serialization such as dateTimeStamp
> > http://www.w3.org/TR/xmlschema11-2/#dateTimeStamp
> > in which the timezoneFrag is required to be "Z".  (I've just filed a
> > bugzilla report on XML Datatypes to ask for such a canonicalization
> > https://www.w3.org/Bugs/Public/show_bug.cgi?id=18452
> > because there doesn't seem to be one defined currently.)
> >
> > How forcefully such canonicalization should be encouraged is a matter
> > for debate.  I do not think it should be a "MUST".  "SHOULD" would be
> > fine, as there are good reasons why someone may want to generate
> > non-canonical literals.  But it may also be good enough to just put an
> > editorial note in the spec saying that "RDF generators are encouraged to
> > generate literals in a standard, canonical form that allows simple
> > string comparison to test for equality and greater-than/less-than when
> > possible".
> 
> I would object to either MUST or SHOULD. In may systems preserving the
> original lexical form is an important feature. 

I agree that preserving the lexical form is important for many
applications, and those should not perform canonicalization.  The
RFC2119 definition of "SHOULD" specifically allows deviation for good
reason:
http://www.ietf.org/rfc/rfc2119.txt 
[[
3. SHOULD   This word, or the adjective "RECOMMENDED", mean that there
   may exist valid reasons in particular circumstances to ignore a
   particular item, but the full implications must be understood and
   carefully weighed before choosing a different course.
]]

Given this definition, why do you think "SHOULD" would be too strong?


> RDF does this well
> today and clearly defines lexical space as separate from value space.
> 
> The current working group direction is try and specify a canonical
> serialization of both a single triple and possibly of a graph as
> specific form of N-Triples. 

Excellent!  I was not aware of this, but I strongly support the idea.

> Cononicalization doesn't stop with just
> datatypes. 

Agreed.  Datatypes just seemed like the most obvious place to start.

> This should serve the use cases that require
> canonicalization well. If there is a specific use case the current WG
> direction won't serve please send it along.

Okay.


-- 
David Booth, Ph.D.
http://dbooth.org/

Opinions expressed herein are those of the author and do not necessarily
reflect those of his employer.

Received on Wednesday, 1 August 2012 19:13:04 UTC