Re: Datatypes, abstract syntax / data model. from Patrick Stickler on 2002-09-16 (w3c-rdfcore-wg@w3.org from September 2002)

From: Patrick Stickler <patrick.stickler@nokia.com>
Date: Mon, 16 Sep 2002 09:53:07 +0300
To: "ext Jan Grant" <Jan.Grant@bristol.ac.uk>, "RDFCore Working Group" <w3c-rdfcore-wg@w3.org>
Message-ID: <005601c25d4d$b68ffd70$864416ac@NOE.Nokia.com>
A few comments/questions:

1. If the datatype+lexicalform (non-canonical) label unambiguously
denotes the value in question, why is a canonical form necessary?
What information does it add? What benefit does it add? (we are
talking here about the abstract graph, so issues of compression
are not relevant). 

It seems that the apparent benefit of canonical lexical forms is
to allow applications that don't grok the datatype more consistently
determine equality and not get tripped up on string unequal synonyms.

However, this benefit is an illusion. Unless the non-canonical lexical
forms are actually mapped to canonical lexical forms, such benefit
cannot be realized, and in order to map the non-canonical lexical forms
to canonical forms, *some* application must understand the datatype,
and therefore, it could just provide the value, or provide a comparision
service for the application that doesn't grok the datatype. Thus, I
see this as (a) unnecessary and (b) useless in actual practice.

2. What right does RDF have to assert a canonical lexical space for
a datatype which does not have one, and what if two applications
disagree about what that canonical lexical space is? Who says what
the canonical form of <foo:blarg>"blech" is if the creator of
<foo:blarg> does not say? Even if I agreed with the benefit of
having canonical forms in the abstact graph, I do not see how we
can justify any such assertions for arbitrary datatypes.

3. You mention that the originally specified lexical form may include
spurious details. How do you know? Is it RDF's right to say what
features are spurious details? They may be, but they may not be.

That's up to the original author to know.

Having worked for many years doing data mining
and automated conversion, I have come to embrace a principle that has
turned out to be golden on countless occasions and that is: never
discard anything you don't have to or over-normalize the data, because
what may appear to be irrelevant now may turn out to be the key to
understanding the original intent later. Thus, I am very suspect of
operations such as being proposed, even if "abstract", especially
since it is unecessary and does not really provide any true benefit
in practice (see #1 above).

The original author chose to use a particular lexical form, and unless
it is necessary to convert that (even abstractly) to some other form
in order to consistently and unambiguously capture the originally
intended meaning (and it's not necessary), we should leave it as it is.

Cheers,

Patrick

[Patrick Stickler, Nokia/Finland, (+358 50) 483 9453, patrick.stickler@nokia.com]


----- Original Message ----- 
From: "ext Jan Grant" <Jan.Grant@bristol.ac.uk>
To: "RDFCore Working Group" <w3c-rdfcore-wg@w3.org>
Sent: 13 September, 2002 18:23
Subject: Datatypes, abstract syntax / data model.


> 
> My vote was for Jos' suggestion, which was: a datatyped literal is
> held (in the abstract syntax) to be the pair
> 
> ( datatype uri, clex )
> 
> ..where clex is the canonical lexical form for the datatype; secondly,
> to make the assertion that such a canonical form exists.
> 
> 
> Patrick S asked, "what if there is no such form?" Well, I choose to
> accept the axiom of choice, which (I think) covers this*. Anyhoo -
> that's the sum total of the proposal as I heard it from Jos, and seemed
> a happy middle-ground. This is why:
> 
> 
> >From an implementation point of view, using "values" and sneaky DB
> primitive types is acceptable. This works, because the DB primitive type
> might be viewed as simply another non-canonical lexical form.
> 
> >From a roundtripping point of view, this is acceptable, since we are
> basically trying to roundtrip from abstract syntax (or data model) to
> abstract syntax, not preserve "spurious" details about the lexical
> forms. If we don't know anything about the lexical form, then no harm is
> done.
> 
> Also, note that no canonicalisation is _required_ from an
> implementation, and that's OK too. If an implementation doesn't know
> about a datatype, there's absolutely nothing it can do better than that
> (the open-world view); if it happens to know about closed-world XSD
> stuff then it may be able to offer more conclusions (Jenny and John have
> the same age).
> 
> jan
> 
> PS. In many ways, this is just another way of spelling out why Jeremy and
> Patrick differ in opinion, but can see each others' point of view - I
> think.
> 
> Some of the reason for this differing of opinion seems to be the subtle
> semantics attached to words like "abstract", "lexical" and "syntax". I
> dunno what to do about that.
> 
> 
> * that is, for any datatype, there exists a canonicalisation function
> which selects from the set of equivalence sets of lexical forms one
> preferred form.
> 
> -- 
> jan grant, ILRT, University of Bristol. http://www.ilrt.bris.ac.uk/
> Tel +44(0)117 9287088 Fax +44 (0)117 9287112 http://ioctl.org/jan/
> User interface? I hardly know 'er!
>
Received on Monday, 16 September 2002 02:53:10 UTC