Re: datatyping discussion

>Folks, this is a current snapshot of the datatyping discussion taken
>place on various lists.

Nice survey!!  Wish I had read it before I wrote 
http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Oct/0446.html

I have some quibbles, but I think they are important for the conclusion.

>This posting reviews several kinds of suggested approaches, lists some
>proposed criteria for comparing them, and concludes with a short
>scratch-the-surface discussion. A list of major relevant references is
>given in the end (please send further pointers to me if I you have
>some).
>
>1. SUGGESTED APPROACHES
>=======================
>
>All suggested approaches can be roughly divided into two groups,
>"typed instances" and "schema-based typing" (also called weak and
>strong typing [OL]). In the former approach, the typing information is
>attached directly to the data values, whereas in the latter the typing
>information is provided in some (typically external) schema or rule
>set.

Quibble Q1: part of the merit (to me) of the suggestions S3 and S4 is 
that they do *not* require an external schema or rule set, but that 
all the relevant information can be asserted in conventional RDFS 
(with a slight extension in S3 to introduce rdf:value). This is 
important for the reason given in Q3.

>Examples and references to some concrete suggestions follow
>below.
>
>Typed instances:
>---------------
>
>(S1) encode literals as URIs, merge literals and resources [PS]
>
>     Examples: x:urn:int:5, x:urn:data:a%20space
>
>(S2) make literals composite, e.g. a pair <resource, unicode string>
>[SM1,SM2]
>
>     Examples: <(http://www.w3.org/2001/XMLSchema-datatypes, integer),
>"5">
>
>(S3) use bNodes [M&S,TBL]
>
>     Examples: John_Smith weight [units Pounds, rdf:value "10"], or
>               John_Smith weight [pounds [decimal "10"]]

Q2. I think it is misleading to characterize this under this heading. 
It only looks that way if you write it in N3. If you wrote this out 
in Ntriples, the typing information is clearly asserted in the RDF 
graph, so this is better called 'schema-based'.

(A general meta-quibble here is that this strong/weak classification, 
though traditional, is only meaningful relative to a particular 
syntax, and is therefore rather shallow, IMHO. I think a much more 
meaningful criterion for us is whether or not the datatyping 
information is available to an RDF inference engine. On that basis, 
S1 and S2 would be classified together (under 'not') and S3 and S4 
classed together (under 'available'))

>Schema-based or rule-based typing:
>---------------------------------
>
>(S4) type of property values is defined in a schema [PH,PPS1], possibly
>      by a set of rules [TBL]
>
>     Examples: (weight rdfs:range
>http://www.w3.org/2001/XMLSchema-datatypesinteger), or
>
>               (John_Smith weight "160 1/8") goes together with a rule
>like
>                'if X is a person living in the US and (X weight Y),
>                 then Y is a "pieces-of-eight" number that gives weight
>in pounds'
>
>2. CRITERIA
>===========
>
>Below is a non-exhaustive list of several criteria that can be used
>for deciding on the suggested approaches. I picked the criteria that
>affect applications critically.
>
>(C1) backward compatibility wrt existing data and applications
>
>(C2) comparing values for custom or unknown datatypes
>      (Is myint:05==myint:5? Given _x1 decimal "5" and _x2 decimal "5",
>is _x1==_x2?)

Q3. This seems meaningless since RDF has no way to express identity. 
So these questions cannot even be posed in RDF; and if some external 
application could infer them, it would have no way to assert its 
conclusions in RDF . So I propose that this criterion simply be 
ignored for now.

>(C3) is typing information self-contained or requires external schema
>[DC2]

I would like to know what 'external' means here. This seems like a 
three-or four-way distinction. Typing information can be:
>>  self-contained or not (ie potentially arising from any general inference)
>>  external (to RDF) or internal.
These seem orthogonal dimensions to me.

>(C4) are multiple type assignments allowed? (e.g. US dollar, decimal)

Better, what happens when they occur? Eg suppose two 
sources/documents/whatever supply different such information; which 
part of the RDF machinery complains?  (the lexical analyzer, the 
parser, an inference engine, or some other datatyping module?)

>(C5) compactness (verbosity of serialization, storage efficiency in
>databases, elegant APIs)
>
>3. DISCUSSION
>=============
>
>The discussion can be found at
>http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/ ;)
>
>Seriously, it looks like encoding of data types using bNodes (S3) is
>still our best bet.

I prefer S4. It scores high on all of C1, C3, C4 , C5; and C2 should 
be ignored. I don't think that S3 does score on C1, by the way; that 
usage is *incompatible* with the M&S examples. That seems to me to be 
a major score against it.

>It is backward compatible for obvious reasons

Seems obviously not. Am I missing something? Eg a simple usage of a 
literal with no datatyping information:

aaa bbb "56788" .

is unchanged in S4, but would be illegal in S3 and must be rewritten as

aaa bbb _:1 .
_:1 rdf:value "56788" .

>(C1), uses self-contained typing (C3)

Why is it self-contained? Like S4, it imposes typing by making RDF 
assertions (using rdf:type). But any RDF assertion might be a 
consequence of some other assertions, eg about rdfs:range; so S3 and 
S4 seem to me to be indistinguishable on C3.

>, and is flexible in allowing
>multiple type assignments (C4).
>
>Of course, deficiencies in (C2) and (C5) are the back side of the
>coin. All of (S1),(S2), and (S4) have a problem with (C2),

RDF has a problem with C2 in general, and it applies just as forcibly 
to S3 (see Q3)

>i.e. comparing typed values, but do well in compactness (C5).
>
>The semantics of datatyping has been investigated extensively in
>[PH,PPS1,PPS2]. It seems that (S3) fits well into the suggested
>theories.

S4 (at least in the version discussed in [PH][PPS1/2] ) fits here 
better. S3 doesn't even require such fitting; it fits into the 
current MT without alteration. However, it doesn't fit with 
widespread practice in literal usage in other languages, so I think 
it would be short-sighted to build it into RDF. Also, frankly, the 
use of rdf:value seems ugly and ad-hoc.

>In conclusion, my suggestion is to focus on (S3) and try to work out
>detailed use scenarios, limitations and work-arounds. Note that (S3)
>is orthogonal to the question whether namespaces are part of the
>model.

I'm not sure what you mean by 'model' here.  S3, like S4, does 
require that datatypes (the things denoted by datatype names) are in 
every satisfying interpretation, and have their 'natural' 
interpretation.

>It seems particularly desirable to be able to identify
>namespaces of properties that carry data types and download the
>associated schemas.

Right, I agree.

Pat

>Sergey
>
>
>REFERENCES
>==========
>
>[DC1]   Dan Connoly. http://www.w3.org/2001/01/ct24
>[DC2]   Dan Connoly.
>http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Oct/0338.html
>[JG]    Jan Grant. http://ioctl.org/rdf/literals
>[OL]    Ora Lassila.
>http://lists.w3.org/Archives/Public/www-rdf-logic/2001Oct/0099.html
>[PH]    Pat Hayes.
>http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Oct/0164.html
>[PPS1]  Peter F. Patel-Schneider.
>http://lists.w3.org/Archives/Public/www-rdf-comments/2001OctDec/0057.html
>[PPS2]  Peter F. Patel-Schneider.
>http://lists.w3.org/Archives/Public/www-rdf-interest/2001Oct/0054.html
>[PS]    Patrick Stickler.
>http://lists.w3.org/Archives/Public/www-rdf-interest/2001Oct/0051.html
>[SM1]   Sergey Melnik.
>http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Sep/0444.html
>[SM2]   Sergey Melnik.
>http://lists.w3.org/Archives/Public/www-rdf-interest/2001Feb/0090.html
>[TBL]   Tim Berners-Lee.
>http://www.w3.org/DesignIssues/InterpretationProperties.html
>[M&S]   http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/


-- 
---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes@ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes

Received on Monday, 22 October 2001 16:16:35 UTC