Correction and Comments ( was RE: DATATYPES: mental dump.)

Please have a look at my X proposal summary, which includes
alot more than the use of URVs (which in fact are a minimal
component).

It suggests (in the graph notation of the proposal, sorry
don't know exactly how to do this in Ntriples):

  [1,S]
    |
    --- subject ----> [2,U,#aaa]
    |
    --- predicate --> [3,U,{eg:prop}]
    |
    --- object -----> [4,U,xsd:integer:10]
                         ^
  [5,S]                  |
    |                    |
    --- subject ----------
    |
    --- predicate --> [6,U,{rdf:type}]
    |
    --- object -----> [7,U,{xsd:integer}]

This latter statement [5,S] may be implicitely assumed
by a system having knowledge about the xsd:integer URV
scheme.

Note that this is a graph notation and thus shouldn't be
compared with NTriples in terms of conciseness.

This is very similar in principle to the P++ proposal in
that it treats literals as subjects and adopts the concept
of literal and uriref labels for nodes, but the graph
model (which is the key) is statement-centric rather than
resource-centric and all statements are reified, and the
present RDF graph model is a "view" or interpretation of
the graph model used in the X proposal.

Please have a look at my recent summary of the X proposal
for all the gory details.

Note especially that the X proposal, as defined in my
summary today, does not require that literals be encoded
as URVs. There are practical benefits to URV encoding,
which are outlined in the summary, but one can use an
X proposal approach and never use URVs.

Cheers, 

Patrick


-----Original Message-----
From: ext Pat Hayes [mailto:phayes@ai.uwf.edu]
Sent: 09 November, 2001 20:50
To: w3c-rdfcore-wg@w3.org
Cc: Peter F. Patel-Schneider
Subject: DATATYPES: mental dump.


After the recent email flurry I think I can distinguish five proposals and
summarize their pros and cons.  They can be distinguished on a primary axis
of degree of localization of datatyping information, ie how 'far away' the
datatyping information relevant to a literal can be from that literal
itself.


X. (Patrick)


Very local indeed; every literal is required to have its datatype included
as part of the literal label itself.  Example (I may have the URV syntax
wrong):


aaa eg:prop <xsd:integer:10> .


In fact, these literal-thingies can be regarded as a form of URI (URV), so
that there are in fact no literals at all. (I will go on referring to these
URVs as 'literal labels' in what follows, however, for consistency.)


Datatype names play no role in the RDFS syntax.


S. (Sergey)


Quite local, in that literals are required to be linked directly to bNodes
by edges labelled with the datatype name. The bNode denotes the value of the
literal;  all literals denote strings.  Example:


aaa eg:prop _:x .
_:x xsd:integer "10" .


Datatype names are names of properties.


DC. (Dan)


Similar; all literals are strings, and similar use of a bNode, but with
separate arcs for the literal and the datatype. Example:


aaa eg:prop _:x .
_:x rdf:label "10" .
_:x rdf:type xsd:integer .


Datatype names are names of classes.


P. (Peter)


Not local at all, in that literals are assigned a datatype indirectly, by
declaring a datatype to be the range of the property used in the triple. The
range information might be anywhere in the graph, and need not be 'close' to
the triple including the literal. Example:


aaa eg:prop "10" .
...
eg:prop rdfs:range xsd:integer .


Notice that the literal label does *not* automatically denote a string in
this case, in contrast to S and DC. In fact, this requires that different
occurrences of the same literal may have different interpretations. Notice
also that rdfs:range is the only way to specify a datatype constraint.


Datatype names are names of classes.


P++. (Pat)


Either local or not, in that *any* piece of RDF(S) that entails that a
literal is in a datatype class is sufficient to fix the datatype, including
range information but also including local rdf:type information applied to
the literal  directly. This is therefore an extension to P.  In practice, it
is only a real extension if literals are allowed to be subjects, so this
proposal involves extending Ntriples notation to Ntriples++ and allowing
literals as subjects. The P and S examples both work here, but so does the
following (in Ntriples++):


aaa eg:prop _:x:"10"  .
_:x rdf:type xsd:integer .


ie the three-node graph


aaa---eg:prop--->"10"---rdf:type--->xsd:integer


(BTW, compare this to the S version, also a three-node graph:


aaa---eg:prop--->[]---xsd:integer--->"10" )


Datatype names can be names of classes or names of properties, or both.


-----------------


OK, now some of the issues that arise. First, the P and P++ proposals both
require a lot more semantic machinery. They require RDF graphs to be
non-tidy on literal nodes, since literal meanings are contextual; they
require extensions to the model theory to be able to handle the 'connection'
between datatyping information and the literals to which that information is
supposed to be applied.  (We can do all that, but it does take some effort
to be able to follow it all, and some of the issues that come up are
subtle.)  These two proposals also require any datatyping scheme to be
'proper' (a term I just invented) in the sense that Patrick identified, viz
that the lexical-to-value mappings must be upward compatible in the datatype
class heirarchy. XML schema is proper in this sense, but some of the
artificial examples that have been used (especially the use of incompatible
integer encodings) are not.


The P++ proposal , in addition, requires extending Ntriples syntax and
allowing literals to be subjects, which breaks RDF/XML.


None of the first three proposals require all this elaboration (although
they are not incompatible with it), since they all assume that literal
meanings are completely specified by the literal label (to be a single
literal value in X, or to be a string in S and DC), and the datatype class
heirarchy, if it exists, is invisible to RDFS. They can all be
straightforwardly handled in RDF/XML.


The S and CD proposals require that users conform to a given 'idiom', and
are often incompatible with current common usage in which literals are used
to refer to things other than strings; in contrast, such usage is handled by
P and P++.  Also, such idioms may be incompatible with extensions to RDFS,
in particular with DAML. (This needs to be checked more carefully.)


The X proposal is incompatible with all current usage as it requires all
literals to be replaced with URVs. However, the translation from current
usage into the new form is straightforward and mechanical, and does not
require any change to the triples structure (eg does not introduce any new
bNodes).


The DC proposal uses more triples than S, and has been criticized on the
grounds that a merge with several different labels would be ambiguous, eg:


aaa eg:prop _:x .
_:x rdf:label "10" .
_:x rdf:type xxd:octal .
_:x rdf:label "1000"
_:x rdf:type xxd:binary .


Contrast with how this would be done in S:


aaa eg:prop _:x .
_:x xxd:octal "10" .
_:x xxd:binary "1000" .


On the other hand, DC shares with P and P++ the ability to express a value
being a literal without saying what its datatype is:


aaa eg:prop _:x .
_:x rdf:label "10" .


or, in P(++), simply:


aaa eg:prop "10" .


Such 'uncommitted' use of a literal label is syntactically impossible in X
or S. (It is not clear whether this counts as a pro or a con; it seems to
depend on whether or not one wishes to be able to check RDF for semantic
integrity or conformity to an external schema.)


-------------------


Here's a table summing up all this as various cons and pros (view in
Courier). The brackets indicate a qualified answer, eg X doesn't strictly
conform to current usage, but the change is minimal. There may be other
issues not listed here, of course. In particular, I have not gone into the
issues that arise if we want to be able to *describe* datatyping schemes in
RDF(S) itself, rather than simply refer to them.


                                X   S   DC  P   P++
CONS
requires literals as subjects                   x
requires change to MT                       x   x
requires DTs to be 'proper'                 x   x
requires user conform to idiom (x)  x   x
(requires literals to be typed) x   x              (pro or con?)
cannot express 'clashing' types x       x  (x) (x)


PROS
fully general                                   x
conforms to current usage      (x)          x   x
allows free type merging            x 
compatible with DAML           (?)          x   ?


-----------


Hope this helps; anyway, I've done a dump of *my* mental state, thank
goodness.


I have to say, on balance, S looks like the best option; simple, compact,
requires no changes to the MT or to RDF/XML, doesn't require commitment to a
new URI scheme, doesn't go beyond the charter, and is able to handle even
'improper' datatyping schemes. It also makes sense within the proposed MT
extension, so would be "upward compatible" if ever anyone wanted to extend
RDF in this more elaborate way in the future. The fact that it can't handle
untyped literals may not be serious, and in any case could be hacked around
in practice. My only serious worry is whether it might break current
DAML+OIL usage of literals. Peter??


Pat


-- 

---------------------------------------------------------------------
IHMC                                       (850)434 8903   home
40 South Alcaniz St.                        (850)202 4416   office
Pensacola,  FL 32501                      (850)202 4440   fax
phayes@ai.uwf.edu
http://www.coginst.uwf.edu/~phayes

Received on Monday, 12 November 2001 15:29:03 UTC