Re: Input sought on datatyping tradeoff from Karsten Otto on 2002-07-12 (www-rdf-comments@w3.org from July to September 2002)

From: Karsten Otto <otto@math.fu-berlin.de>
Date: Fri, 12 Jul 2002 16:41:37 +0200 (CEST)
To: Brian McBride <bwm@hplb.hpl.hp.com>
cc: www-rdf-comments@w3.org
Message-ID: <Pine.LNX.4.44.0207121637370.20174-100000@hobbes.inf.fu-berlin.de>
Hello!

Short answer: I vote in favor of Test D. When trying to find information
on the Semantic Web, I prefer to get exact results, not some fuzzy
almost-match. However, I am also aware of the importance of Test A1.

I believe the enforcement of canonical data representation provides
a solution here. Special cases like Test A1 will work, and search
engines do not have to "know" thousands of possible data types for
simple comparison. This may not look pretty in the triple dumps,
but then RDF is primarily intended for machine processing.

Long answer follows...


> Let's explain the basic ideas behind our approach to datatyping.  The aim
> is to define how datatype values, e.g. integers, dates etc should be
> represented in RDF.  We are building on the XML Schema datatypes
> specification.

To answer this question, I want to distinguish between smart applications
(agents etc) that understand = have intrinsic knowledge about certain
vocabularies, and generic applications (search engines etc) which do not,
that is they have to work solely with available triples.

I assume a scope of generic applications, specially search engines, for
this discussion. I believe they are crucial for the success of the
Semantic Web, because they will enable people (and agents for that matter)
to find the information they look for. It does not help if there are
scattered information pieces which are formally correct according to
some style, but cannot be found and used because they do not match
the style of the request.

> It is important in getting the semantics correct that we distinguish
> between a datatype value, e.g. the integer 10 and a lexical representation
> of the value, e.g. the string "10".

In that light, my answer also depends on the restriction to canonical
representation styles for literals, that is, if two values are equal,
their string representation must be equal too.

>Test A:
>
>   <Jenny> <ageInYears> "10" .
>   <John>  <ageInYears> "10" .
>
>Should an RDF processor conclude that the value of the ageInYears
>properties for Jenny and John are the same?

Yes, but only for canonical representation style. In that case the
use of the same property means the same range type, whatever it is,
which in turn means the same canonical representation style. Thus,
as the literals have the same representation, they also denote the
same value.

No, if we do not have canonical representation style. In that case,
there is insufficient data. We cannot determine the type or actual
value of the literal. Remember that <ageInYears> looks like
<vhfgdj> to a generic application, and its values might as well be
encoded in binary for example.

>Test A2:
>
>   <Jenny> <ageInYears> "10" .
>   <Jenny> <testScore>  "10" .
>
>Should an RDF processor conclude that the value of Jenny's ageInYears
>property is the same as the value of Jenny's testScore property?

No, in any case. Again we have insufficient data. <ageInYears> and
<testScore> could very well have completely different range types.
We cannot determine if the given literals have the same canonical
representation style, and thus cannot compare them.

Note that we could do that if we knew the range types. If both were
for example known to be decimals, we could compare them and find them
to be equal.

>Test A3:
>
>   <Jenny> <ageInYears>   "10" .
>   <Film>  <title>        "10" .
>
>Should an RDF processor conclude that the value of Jenny's age property is
>the same as the value of the Film's title property?  If the value the
><ageInYears> property is an integer, and the value of the <title> property
>is a string, they are not the same thing and are thus not equal.

No, again we have insufficient data. Note that a smart application could
do this however. Assume it knows the meaning and type of <ageInYears>
and <title>. Then it could "typecast" the <title> value to an <ageInYears>
value (decimals in this case) and see if they match.

>Now for a different kind of test.  How do the values of the two idioms relate?
>
>Test D:
>
>   <Jenny>      <ageInYears> "10" .
>   <ageInYears> rdfs:range xsd:decimal .
>
>   <John>  <ageInYears>   _:a .
>   _:a     xsdr:decimal   "10" .
>
>Should an RDF processor conclude that Jenny and John have the same
>age?  [Note: in this example the range constraint is expressed using
>rdfs:range.  We may have to introduce a special datatyping range property,
>but that is an independent detail for now.]
>

Yes, we know the literals have the same type. Besides, if we have two
ways of expressing type, of course they should be treated as equivalent.

If we have canonical representation styles, we can compare the literal
strings and conclude the value is the same. Otherwise, we have to know
the given data type xsdr:decimal, that is, we must know how to parse
its various representations, and compare the resulting value as
appropriate for this type. This is easy for simple types, but just
think of some xsdr:date...

Conclusion:

I vote for strong data typing, while enforcing the use of canonical
representation styles. That way, search engines can provide exact
and type-safe query results. At the same time, they can safely compare
values by their string representation, without having to really
understand a given type. If we have an extensible type system,
this ability is crucial for handling a potentially very large number
of obscure and complex data types.

Cheers,
Karsten
Received on Friday, 12 July 2002 10:41:41 UTC