DATATYPES: mental dump.

After the recent email flurry I think I can distinguish five 
proposals and summarize their pros and cons.  They can be 
distinguished on a primary axis of degree of localization of 
datatyping information, ie how 'far away' the datatyping information 
relevant to a literal can be from that literal itself.

X. (Patrick)

Very local indeed; every literal is required to have its datatype 
included as part of the literal label itself.  Example (I may have 
the URV syntax wrong):

aaa eg:prop <xsd:integer:10> .

In fact, these literal-thingies can be regarded as a form of URI 
(URV), so that there are in fact no literals at all. (I will go on 
referring to these URVs as 'literal labels' in what follows, however, 
for consistency.)

Datatype names play no role in the RDFS syntax.

S. (Sergey)

Quite local, in that literals are required to be linked directly to 
bNodes by edges labelled with the datatype name. The bNode denotes 
the value of the literal;  all literals denote strings.  Example:

aaa eg:prop _:x .
_:x xsd:integer "10" .

Datatype names are names of properties.

DC. (Dan)

Similar; all literals are strings, and similar use of a bNode, but 
with separate arcs for the literal and the datatype. Example:

aaa eg:prop _:x .
_:x rdf:label "10" .
_:x rdf:type xsd:integer .

Datatype names are names of classes.

P. (Peter)

Not local at all, in that literals are assigned a datatype 
indirectly, by declaring a datatype to be the range of the property 
used in the triple. The range information might be anywhere in the 
graph, and need not be 'close' to the triple including the literal. 
Example:

aaa eg:prop "10" .
...
eg:prop rdfs:range xsd:integer .

Notice that the literal label does *not* automatically denote a 
string in this case, in contrast to S and DC. In fact, this requires 
that different occurrences of the same literal may have different 
interpretations. Notice also that rdfs:range is the only way to 
specify a datatype constraint.

Datatype names are names of classes.

P++. (Pat)

Either local or not, in that *any* piece of RDF(S) that entails that 
a literal is in a datatype class is sufficient to fix the datatype, 
including range information but also including local rdf:type 
information applied to the literal  directly. This is therefore an 
extension to P.  In practice, it is only a real extension if literals 
are allowed to be subjects, so this proposal involves extending 
Ntriples notation to Ntriples++ and allowing literals as subjects. 
The P and S examples both work here, but so does the following (in 
Ntriples++):

aaa eg:prop _:x:"10"  .
_:x rdf:type xsd:integer .

ie the three-node graph

aaa---eg:prop--->"10"---rdf:type--->xsd:integer

(BTW, compare this to the S version, also a three-node graph:

aaa---eg:prop--->[]---xsd:integer--->"10" )

Datatype names can be names of classes or names of properties, or both.

-----------------

OK, now some of the issues that arise. First, the P and P++ proposals 
both require a lot more semantic machinery. They require RDF graphs 
to be non-tidy on literal nodes, since literal meanings are 
contextual; they require extensions to the model theory to be able to 
handle the 'connection' between datatyping information and the 
literals to which that information is supposed to be applied.  (We 
can do all that, but it does take some effort to be able to follow it 
all, and some of the issues that come up are subtle.)  These two 
proposals also require any datatyping scheme to be 'proper' (a term I 
just invented) in the sense that Patrick identified, viz that the 
lexical-to-value mappings must be upward compatible in the datatype 
class heirarchy. XML schema is proper in this sense, but some of the 
artificial examples that have been used (especially the use of 
incompatible integer encodings) are not.

The P++ proposal , in addition, requires extending Ntriples syntax 
and allowing literals to be subjects, which breaks RDF/XML.

None of the first three proposals require all this elaboration 
(although they are not incompatible with it), since they all assume 
that literal meanings are completely specified by the literal label 
(to be a single literal value in X, or to be a string in S and DC), 
and the datatype class heirarchy, if it exists, is invisible to RDFS. 
They can all be straightforwardly handled in RDF/XML.

The S and CD proposals require that users conform to a given 'idiom', 
and are often incompatible with current common usage in which 
literals are used to refer to things other than strings; in contrast, 
such usage is handled by P and P++.  Also, such idioms may be 
incompatible with extensions to RDFS, in particular with DAML. (This 
needs to be checked more carefully.)

The X proposal is incompatible with all current usage as it requires 
all literals to be replaced with URVs. However, the translation from 
current usage into the new form is straightforward and mechanical, 
and does not require any change to the triples structure (eg does not 
introduce any new bNodes).

The DC proposal uses more triples than S, and has been criticized on 
the grounds that a merge with several different labels would be 
ambiguous, eg:

aaa eg:prop _:x .
_:x rdf:label "10" .
_:x rdf:type xxd:octal .
_:x rdf:label "1000"
_:x rdf:type xxd:binary .

Contrast with how this would be done in S:

aaa eg:prop _:x .
_:x xxd:octal "10" .
_:x xxd:binary "1000" .

On the other hand, DC shares with P and P++ the ability to express a 
value being a literal without saying what its datatype is:

aaa eg:prop _:x .
_:x rdf:label "10" .

or, in P(++), simply:

aaa eg:prop "10" .

Such 'uncommitted' use of a literal label is syntactically impossible 
in X or S. (It is not clear whether this counts as a pro or a con; it 
seems to depend on whether or not one wishes to be able to check RDF 
for semantic integrity or conformity to an external schema.)

-------------------

Here's a table summing up all this as various cons and pros (view in 
Courier). The brackets indicate a qualified answer, eg X doesn't 
strictly conform to current usage, but the change is minimal. There 
may be other issues not listed here, of course. In particular, I have 
not gone into the issues that arise if we want to be able to 
*describe* datatyping schemes in RDF(S) itself, rather than simply 
refer to them.

                                 X   S   DC  P   P++
CONS
requires literals as subjects                   x
requires change to MT                       x   x
requires DTs to be 'proper'                 x   x
requires user conform to idiom (x)  x   x
(requires literals to be typed) x   x              (pro or con?)
cannot express 'clashing' types x       x  (x) (x)

PROS
fully general                                   x
conforms to current usage      (x)          x   x
allows free type merging            x 
compatible with DAML           (?)          x   ?

-----------

Hope this helps; anyway, I've done a dump of *my* mental state, thank goodness.

I have to say, on balance, S looks like the best option; simple, 
compact, requires no changes to the MT or to RDF/XML, doesn't require 
commitment to a new URI scheme, doesn't go beyond the charter, and is 
able to handle even 'improper' datatyping schemes. It also makes 
sense within the proposed MT extension, so would be "upward 
compatible" if ever anyone wanted to extend RDF in this more 
elaborate way in the future. The fact that it can't handle untyped 
literals may not be serious, and in any case could be hacked around 
in practice. My only serious worry is whether it might break current 
DAML+OIL usage of literals. Peter??

Pat

-- 
---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes@ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes

Received on Friday, 9 November 2001 13:49:56 UTC