Re: Datatypes, syntax and equality

On Thu, 11 Jul 2002, Brian McBride wrote:
>
> The RDFCore WG is producing a proposal for how XML Schema datatypes should
> be used in RDF.  We would like some guidance on a particular tradeoff we
> have to make.
>
> The WG requests that you send your considered answers to
> www-rdf-comment@w3.org.  Please can we have all responses by 26th July
> 2002.  Questions and discussion should take place on this list.
>
> INTRODUCTION TO DATATYPES
> =========================
>
> Let's explain the basic ideas behind our approach to datatyping.  The aim
> is to define how datatype values, e.g. integers, dates etc should be
> represented in RDF.  We are building on the XML Schema datatypes
> specification.
>
> It is important in getting the semantics correct that we distinguish
> between a datatype value, e.g. the integer 10 and a lexical representation
> of the value, e.g. the string "10".
>
> We are proposing two principal idioms for representing datatyped
> information.  The first looks like this:
>
>    <Jenny> <age>          _:a .
>    _:a     <xsdr:decimal> "10" .
>
> This can be written in RDF/XML like this.
>
>    <rdf:Description rdf:about="Jenny">
>      <foo:age xsdr:decimal="10"/>
>    </rdf:Description>
>
> Here the b-node _:a denotes the integer 10 which can be represented in
> decimal form as the string "10".
>
> This idiom treats an XML schema datatype as a mapping from a value to a
> lexical representation of the value; this mapping is represented in RDF by
> a property.
>
> We believe this idiom to be quite straightforward, but not sufficient on
> its own because it is common practise to write things like:
>
>    <jenny> <age> "10" .
>
> where the author of this fragment of RDF means to represent the fact that
> Jenny's age is the number 10.  This is the second idiom, which is where we
> need some guidance.

Where in this statement about Jenny does it say that Jenny's age is the
*number* 10?  All it is saying is that Jenny's age is "10".

In the RDF model theory it is stated:

"An RDF literal has three parts (a bit, a character string, and a language
tag), but we will treat them simply as character strings, since the other parts
of the literal play no role in the model theory."

It is stated elsewhere that the character string is a fully normalized UNICODE
string.  In other words, in RDF the statement

   <rdf:Description rdf:about="Jenny">
     <foo:age>10</foo:age>
   </rdf:Description>

is currently equivalent to:

   <rdf:Description rdf:about="Jenny">
     <foo:age xsd:string="10"/>
   </rdf:Description>

> SOME TEST CASES
> ===============
>
> It is here that we need some advice, because we have a choice to make in
> the way we define the formal semantics.
>
> A few simple test cases:
>
> Test A:
>
>    <Jenny> <ageInYears> "10" .
>    <John>  <ageInYears> "10" .
>
> Should an RDF processor conclude that the value of the ageInYears
> properties for Jenny and John are the same?

Yes.

> There are variations on this test which should be considered before answering.
>
> Test A2:
>
>    <Jenny> <ageInYears> "10" .
>    <Jenny> <testScore>  "10" .
>
> Should an RDF processor conclude that the value of Jenny's ageInYears
> property is the same as the value of Jenny's testScore property?

Yes.

> Test A3:
>
>    <Jenny> <ageInYears>   "10" .
>    <Film>  <title>        "10" .
>
> Should an RDF processor conclude that the value of Jenny's age property is
> the same as the value of the Film's title property?  If the value the
> <ageInYears> property is an integer, and the value of the <title> property
> is a string, they are not the same thing and are thus not equal.
>
> The answer must be the same for all three of these A tests.

One source of inspiration on this problem is to look at modern programming
languages.  There are very few programming languages where "10" and 10 would be
silently and automatically converted to each other, and virtually every modern
programming language would have answered "Yes" to all of the A tests, and would
have generated an error diagnostic for the D test below.  In Java, the
expression

  Jenny.ageInYears.equals(Film.title)

would return true.  In C/C++, the expression

  strcmp(Jenny->ageInYears, Film->title)

would return 0.  Programmers don't seem to have any problem with this.

To the extent that programming languages are an indication of how to resolve
this issue, they suggest that the type of a literal should not be implicitly
determined by its context.

> These test cases only relates to the situation where there are no range
> constraints on the properties.
>
> Now for a different kind of test.  How do the values of the two idioms relate?
>
> Test D:
>
>    <Jenny>      <ageInYears> "10" .
>    <ageInYears> rdfs:range xsd:decimal .
>
>    <John>  <ageInYears>   _:a .
>    _:a     xsdr:decimal   "10" .
>
> Should an RDF processor conclude that Jenny and John have the same
> age?  [Note: in this example the range constraint is expressed using
> rdfs:range.  We may have to introduce a special datatyping range property,
> but that is an independent detail for now.]

If one accepts that strong typing is a good idea, then from the first two
statements, one can infer that "10" has two types: xsd:string and xsd:decimal.
These two must necessarily be disjoint (see note below).  Therefore if both
statements are asserted, then the annotation is logically inconsistent.

In modern typed programming languages, if one asserts that a variable has a
numeric type, then setting the variable to a string will result in an error
diagnostic.

Therefore my suggested answer here is neither yes nor no.  An RDF processor
should never even reach this stage because the annotation would be
inconsistent.  If it did reach this stage, then the processor could conclude
whatever it wanted because in an inconsistent theory everything is true (and
everything is false also, because inconsistency implies that true == false).

Note: I am not absolutely certain that in XMLSchema xsd:string and xsd:decimal
are disjoint.  However, it is a reasonable assumption.  If one allowed them to
overlap, then all kinds of inconsistencies would result.  For example, as
strings "010" and "10" are not equal to each other, but both are acceptable
decimal numbers, and as decimal numbers they are equal.  Still other examples
could be constructed using different bases (octal, hexadecimal, etc.)

> It is not possible to have the answers to Tests A and Test D both be
> yes.  Either the A's can be yes or D can be yes, but not both.  We have to
> decide which of these is the most important to have.
>
>
> WHY THESE TEST CASES MATTER
> ===========================
>
> The formal semantics can define the meaning of a literal in one of two
> ways, given:
>
>    <Jenny> <ageInYears> "10" .
>
>    tidy) the <ageInYears> property takes a value which is a numeral, i.e. a
> string
>
>    untidy) the <ageInYears> property takes a value which is some datatype
> value whose string  representation is "10", but without further
> information, such as
> a range constraint, we can't tell exactly what the value is, e.g. the
> string might be in octal.
>
> If we choose the tidy option, the object of the statement is always a
> string, which means that in:
>
>    <Jenny> <ageInYears> "10" .
>    <Film>  <title>      "10" .
>
> the values of the two properties are the same; they are both the STRING "10".
>
> If we choose the untidy option, the value of the object of the statement is
> unknown from this statement alone; a range constraint is required to
> determine the value from the literal string:
>
>    <jenny>      <ageInYears> "10" .
>    <ageInYears> <rdfs:range> <xsd:decimal> .
>
> With a range constraint, we can know that the object of the property is the
> integer 10.

Using a range constraint as a means of determining the type of a literal
is difficult to make monotonic.  It may be possible, but it will necessarily
be complicated to do so.  It is a good idea to keep it simple (tidy).

It is also a good idea to be compatible with other standards when possible.
One of these is RDF itself, as formally defined by its model theory.

> CONCLUSION
> ==========
>
> To end then, please send a message to www-rdf-comments@w3.org (by 26 July
> 2002) indicating whether you believe its more important to have the answer
> to test cases A be yes, or test case D be yes:
>
>    Test A:
>
>    <Jenny> <ageInYears> "10" .
>    <John>  <ageInYears> "10" .
>
> Test D:
>
>    <Jenny>      <ageInYears> "10" .
>    <ageInYears> <rdfs:range> <xsdr:decimal> .
>
>    <John>  <ageInYears>      _:a .
>    _:a     <xsdr:decimal>   "10" .
>
>
> We would also like to know the reasons for this preference.

My reasons for preferring A to D are:

1. Simplicity.  IMHO, choice D is much more complicated.

2. Enforces strong datatyping.  IMHO, choice D violates datatyping and
strongly violates strong datatyping.

3. Compatibility with another WWW standard: XMLSchema.

4. Compatibility with RDF as it is currently formally defined in the RDF Model
Theory document.

5. Compatibility with modern typed programming languages.

Ken Baclawski
College of Computer Science
Northeastern University

Received on Friday, 26 July 2002 17:49:37 UTC