Re: Datatype question from Geoff Chappell on 2002-06-25 (www-rdf-interest@w3.org from June 2002)

From: Geoff Chappell <geoff@sover.net>
Date: Tue, 25 Jun 2002 06:22:18 -0400
To: "Patrick Stickler" <patrick.stickler@nokia.com>, "RDF Interest" <www-rdf-interest@w3.org>
Message-ID: <05e201c21c32$2f9a7770$825ec6d1@goat1>
----- Original Message -----
From: "Patrick Stickler" <patrick.stickler@nokia.com>
To: "ext Geoff Chappell" <geoff@sover.net>; "RDF Interest"
<www-rdf-interest@w3.org>
Sent: Tuesday, June 25, 2002 3:10 AM
Subject: Re: Datatype question


>
> On 2002-06-24 17:07, "ext Geoff Chappell" <geoff@sover.net> wrote:
>
> >
> > ----- Original Message -----
> > From: "Patrick Stickler" <patrick.stickler@nokia.com>
> > To: "ext Geoff Chappell" <geoff@sover.net>; "RDF Interest"
> > <www-rdf-interest@w3.org>
> > Sent: Monday, June 24, 2002 9:03 AM
> > Subject: Re: Datatype question
> >
> >
> > [.......]
> >>> I can see the value of the untidy literal approach to datatyping. I do
> > think
> >>> though, there is a practical impementation advantage to tidy literals
> > (which
> >>> admittedly may not outweight the cost of keeping them).
> >>
> >> There is no such practical implementation advantage to tidy literals.
You
> >> can, in your implementation, employ tidy literal nodes in the triples
> >> store, so long as you preserve the semantic untidyness. There are
numerous
> >> ways to optimize storage of untidy literals. So no worries there.
> >>
> >
> > it's not the storage I'm concerned about (because as you say that's easy
to
> > deal with). Saying "There is no such practical implementation advantage
to
> > tidy literals" is equivalent to saying there is no practical
implementation
> > adantage to using anything but bnodes/existential variables to identify
> > resources - i.e. it disassociates nearly completely identity of the
denoted
> > object with its label and relies upon additional information to
establish
> > identity.
> >
> > It's one thing to say that multiple names may refer to the same
> > object, it's something else entirely to say that the same name can refer
to
> > multiple objects.
>
>
> Exactly. That's the point. Literals (IMO) are contextual labels. They
> are interpreted within the context of some datatype.
>
> Literals are not global constants. That is what URIrefs are for.
>

Well they can be if they denote themselves; they only can't if we're using
them as names to refer to other things. Don't misunderstand me, I can see
the appeal of using them as names because that's clearly what we do in
everyday use. But (IMO) it is a significant change to RDF - systems designed
to use unambiguous names will have to be redesigned to deal with ambiguous
ones. And that's all being driven by the desire to support the inline idiom
(age x "10"), right? insert another node and literals are just strings again
instead of names. My original post was just exploring whether this one
special, albeit common, case could be dealt with in other ways.


> Do you really expect the literal "1984" in all the following cases to
> refer to the same value?
>
>    ABook title "1984" .        ("1984")
>    OurTown population "1984" . (1984, decimal encoding)
>    Widget productCode "1984" . (6532, hexidecimal encoding)
>    Bob yearOfBirth "1984" .    (calendar year 1984)
>
> i.e.
>
>    title rdfs:range xsd:string .
>    population rdfs:range xsd:integer .
>    productCode rdfs:range xyz:hexInt .
>    yearOfBirth rdfs:range xsd:gYear .
>
> > It's the cost of "preserv[ing] the semantic untidyness"
> > that I.m concerned about
>
> Well, as Jeremy Carroll has so
> accurately pointed out: there's untidyness in there somewhere.
> I.e., those literals *do* mean different things. They are not
> global constants.
>
> > because in many implementation it results in
> > cross-product behavior followed by functional equality testing to winnow
the
> > values.
> >
> > I'm sure it's not insurmountable but I think it's fair to say there
> > will be a measurable cost.
>
> I'm not sure I fully follow what you mean here.
>

Sorry, I was rushing when I wrote it. My only point was that queries with
multiple conditions are more efficient if those conditions have common
bindings - e.g.  I'd rather be waiting for my system to process "{?a ?b ?c}
and {?c ?d ?e}"  than "{?a ?b ?c} and {?d ?e ?f} and
somefunc(?c)=somefunc(?d)". I realize that even in a tidy world we'd often
end up with this pattern - i.e. some types of values would require case
insensitve comparisons, canonicalized comparisons, etc. to be useful. Of
course not all literals are representations of datatype values. Some are
just strings. For example a description is just a description. Changes made
to literals to support datatyping also of course have an effect upon these
"plain" literals.


> The requirements/cost for comparing datatyped values will be the same
> whether literals are tidy or untidy. This is because (a) lexical
> forms are not constrained to be canonical. I.e. the integer value
> 5 can be represented by an infinite number of lexical forms
> ("5", "05", "5.0", "5.0000...", etc.) and thus, simple string
> comparison will never ensure that two values are not actually
> equivalent, even if they are not string equal. And (b) value
> comparisons must be made externally to RDF by applications which
> have full knowledge of the datatypes in question. RDF does not
> define datatypes. Datatypes are fully opaque at the RDF level.
> All RDF can provide is an association between a lexical form
> (literal) and the datatype according to which it should be
> interpreted. And the definition of what a datatype is tells
> us that for any given lexical form, for a particular datatype,
> that lexical form maps to one and only one value.
>
> What untidy literals does give us, is an explicit denotation
> of that single value to which a given lexical form maps to
> (even though we can't know exactly which value that is at the RDF
> level). And thus, in conjunction with an external application
> with full knowledge of datatypes, equivalence relations between
> value denotations can be determined and expressed in terms of
> the RDF graph. And since the same lexical form can map to different
> values according to different datatypes (per the examples above)
> that denotation cannot be tidy, if RDF is to accomodate the
> semantic untidyness which is inherent in literals.
>
> Now, we *could* say that insofar as RDF is concerned, all those
> property values are just strings -- and leave it up to applications
> to deduce what is meant by them. But that forgoes the ability
> to define equivalence relations between acutual values, since
> the values may not have any denotation in the graph (as would
> be the case with the inline idiom).
>
> I'd rather have my RDF be explicit about the meaning of the literal
> rather than leaving up to some external application to guess what
> is meant (and possibly get it wrong). Of course, this view is
> not necessarily shared by everyone.
>
> As an aside, the WG should be publishing some detailed documents
> about datatyping very soon, so rather than simply re-iterate
> what is already explained in detail in the WD, and also that
> which is currently under debate by the WG, I'll ask you to
> wait just a bit for all the gory details.

Yeah, I know I'm jumping the gun. Thanks for taking the time to reply
anyway.

>
> Cheers,
>
> Patrick
>
> --
>
> Patrick Stickler              Phone: +358 50 483 9453
> Senior Research Scientist     Fax:   +358 7180 35409
> Nokia Research Center         Email: patrick.stickler@nokia.com
>
Received on Tuesday, 25 June 2002 05:53:19 UTC