Re: Input sought on datatyping tradeoff from David Booth on 2002-07-24 (www-rdf-interest@w3.org from July 2002)

From: David Booth <dbooth@w3.org>
Date: Tue, 23 Jul 2002 23:46:53 -0400
To: www-rdf-comments@w3.org
Cc: www-rdf-interest@w3.org
Message-Id: <5.1.0.14.0.20020723162307.03b08ec8@localhost>
Brian McBride <bwm@hplb.hpl.hp.com> writes:
>If we choose the untidy option, the value of the object of the statement 
>is unknown from this statement alone; a range constraint is required to 
>determine the value from the literal string:
>
>         <jenny> <ageInYears> "10" .
>         <ageInYears> <rdfs:range> <xsd:decimal> .
>
>With a range constraint, we can know that the object of the property is 
>the integer 10.

I have three comments.  One a simple answer to the A versus D question; one 
on the "tidy" versus "untidy" alternatives; and the third on data types in 
general, which has direct bearing on the question at hand.

1. It is clearly more important for test A to be true than test D, in order 
for RDF to be Web scalable, as explained below.  However, it should be 
possible for both test A and test D to be true, if you apply different 
equality operators to tests A and D.  Test A could check for literal string 
equality; test D could check for integer equality.  No contradiction and no 
unnecessary restriction.  Further explanation is below.

2. It seems to me that the "untidy" option would be unscalable and 
incompatible with Web architectural principles.  Allowing "anyone to say 
anything about anything" means that people should be able to make 
statements about UNtyped data as well as typed data.  It would be very bad 
if the RDF processor were to throw up its hands and say "sorry, I don' t 
know if they're equal" even if the application that's asking doesn't care 
at all about datatypes.

(Perhaps the statements didn't even originate as RDF.  They may have 
originated in non-RDF XML, which was transformed through some kind of 
mapping to generate RDF.  There's a LOT of XML and other structured data 
around whose semantics should be usable once the data is mapped to RDF.  In 
fact, most RDF probably won't originate as RDF.)

If I don't have data type information, I should still be able to make 
*some* kind of inferences and comparisons with my data.  I should not be 
dead in the water.  If I *do* have (complete) data type information, and I 
wish to use it, then I should be able to make *additional* inferences about 
my data.

Remember, I may be comparing data from wildly different sources across the 
Web -- some having very complete type information, and some having little 
or no type information.  For Web-level scalability, complete type 
information must not be required.  It is imperative that I still be able to 
make sensible literal comparisons in the absence of more sophisticated type 
information.

Furthermore, requiring <ageInYears> to have a datatype (beyond string) is 
almost like limiting it to have only a single datatype or interpretation, 
and I don't think RDF should have this limitation.  (In other words I think 
RDF should allow something to simultaneously have more than one type, like 
multiple inheritance: HP-LaserJet-3100 is-a-kind-of Printer, but also 
HP-LaserJet-3100 is-a-kind-of FaxMachine.  But please correct me if you 
think I'm wrong here!)  Would the requirement for a sensible equality 
comparison be that there exist at-least-one data type for <ageInYears>?  Or 
would the requirement be that there exist one-and-only-one data type for 
<ageInYears>?  If more than one data type is permitted for <ageInYears>, 
then how do we know which one should be used in the comparison?  (See point 
2 below for more on this.)

A typed comparison is very different from an untyped string comparison.  It 
involves transforming the original (string) representation into a 
type+value pair and then comparing both the types and the values.  This 
transformation is important and should be explicitly represented.  It seems 
like the "untidy" option would gloss over this transformation and require 
it to be built-in to any interpretation of the data, rather than being an 
explicit overlay that one may optionally apply.

Bottom line: Data types should NOT be required.  They should provide 
additional benefit if used.  The "untidy" option would require datatypes in 
order to do any sensible processing, which is not Web scalable, and is 
therefore a BAD option.

3.  Bill de hÓra <dehora@eircom.net> writes in 
http://lists.w3.org/Archives/Public/www-rdf-interest/2002Jul/0059.html :

> > Test A:
> >
> >    <Jenny> <ageInYears> "10" .
> >    <John>  <ageInYears> "10" .
> >
> > Should an RDF processor conclude that the value of the ageInYears
> > properties for Jenny and John are the same?
>
>[The processor should ask:] what does ageInYears say the answer to this
>question is? [The answer is] it depends on the semantics of
>the RDF property, period.
>
>. . .
>
>I would certainly add these to your test case:
>
>    <John>  <ageInYears> "ten" .
>    <John>  <ageInYears> "Ten" .
>
>which should make clear the point about why properties need to be
>deferred to for questions such as this, *unless* literals are given
>types.

+1 to Bill's comments, except that I think it is perfectly reasonable (and 
natural) for multiple kinds of comparison to exist (for different data 
types) and for a string literal comparison to be already known to a 
processor on boot-up, without loading any rule sets.

A string comparison is not the same as an integer comparison, which is not 
the same as a myPrivateDataType comparison, just as a string is not the 
same as an integer, which is not the same as a myPrivateDataType value.

If you want to compare two things as anything other than literal strings, 
to see if they are equal, you not only need to know the data types of the 
things that you wish to compare, but you also need to know what kind of 
COMPARISON you wish to make.  I.e., you need to know the data type of the 
comparison operator that you wish to use.  If a thing is only permitted to 
have one data type (an unwise restriction, in my opinion), and the data 
types of the things that you wish to compare happen to be the same, then 
the processor can easily guess which comparison operator to use.  However, 
if things can have more than one data type, and/or you wish to compare 
things of different types, then you need to know the data type of the 
comparison you wish to make, and seek a coercion from the things' initial 
types to the input type of the comparison operator.  In other words, the 
choice of comparison operator depends on the application, or the kind of 
question that you are trying to ask about the data -- not only on the data 
itself.

For example, if we ask an RDF processor whether an ageInYears of "10" and a 
filmTitle of "10" are equal as strings, the answer should be yes.  If we 
ask the processor whether they are equal as integers, the answer should be 
no (assuming a filmTitle has no defined coercion to type integer).  And if 
we ask the processor whether they are equal as values of myPrivateDataType, 
then the answer should depend on: (a) whether ageInYears and filmTitle both 
have coercions to myPrivateDataType; and (b) the equality rules for 
myPrivateDataType.


-- 
David Booth
W3C Fellow / Hewlett-Packard
Telephone: +1.617.253.1273
Received on Tuesday, 23 July 2002 23:45:54 UTC