Proposal to incorporate datatyping into the model theory (was Re: datatyping discussion) from Pat Hayes on 2001-10-22 (w3c-rdfcore-wg@w3.org from October 2001)

From: Pat Hayes <phayes@ai.uwf.edu>
Date: Mon, 22 Oct 2001 18:14:47 -0500
To: w3c-rdfcore-wg@w3.org
Message-Id: <p0510103bb7fa3c0951bc@[205.160.76.193]>
Martyn Horner <martyn.horner@profium.com> wrote:

>Brian McBride wrote:
>.....
>>  > (C4) are multiple type assignments allowed? (e.g. US dollar, decimal)
>>
>>  As above, I don't see either of these as a 'type', so I'm not sure this
>>  critereon is well formed.  Nor is it a criterion, unless a 
>>preference for one or
>>  the other is specified.
>
>Seriously, dollars and deicmals are not types but encodings of data of a
>certain type... surely?
>The unit chosen maps the integer value into a sequence of numerals in
>the same way that the choice of radix does. Therefore `decimal' and
>`pounds' belong in the same syntactic position. The type which selects
>the semantic domain belongs elsewhere. Radix and unit have the same role
>as `lang' - they stipulate how the characters are to mapped into a
>semantic sub-domain which itself has a particular type.

Yes, I would agree. I think we need to not have blinkers about what 
someone might want to consider a datatype, and try to keep our 
treatment as general-purpose as possible.

I think that the general picture that emerges from the process of my 
learning about datatypes from Peter provides a very general framework 
that can accomodate all the proposals made so far in a uniform way 
with a single semantics.

Heres a sketch of how it goes.

Literals are lexical items which can be somehow generically 
distinguished from urirefs. (And that is all we say about them in 
general.) The basic idea is that (unlike urirefs) they are understood 
to be assigned a common meaning by some 'global' conventions that are 
used independently of the particular interpretation; however, there 
may be several such 'global' conventions, so we need a general 
mechanism for indicating which convention is intended.

A datatype is a rule which embodies some such 'global' conventions 
for determining the meaning of a literal, ie (mathematically 
speaking) it is a function from literals L to values LV. (And that is 
all we say about them.) Each literal has a *fixed* interpretation in 
a given datatype. (This is what it means to say that the 
interpretation of the literal is 'independent' of the particular 
interpretation - in the MT sense - of an RDF graph.)

However, the choice of datatype to be used in interpreting any 
particular literal label may depend upon, or be influenced by, other 
information which is encoded in the graph, and therefore may depend 
on the particular interpretation. (This is what it means to say that 
the meaning of any particular literal label may depend on the 
interpretation.)

A datatyping scheme is a set of datatypes and some method of 
assigning them to occurrences of literals.  (And that is about all we 
say about them.)

Datatyping schemes can be defined in various ways depending on the 
method used. One way is to incorporate a syntactic label for the 
datatype into the literal itself, and require that it be used to 
interpret the literal string. Another way is to regard datatypes as 
objects in the domain and make assertions about their relationships 
to the literal strings. Another, more like a conventional model 
theory, is to not give any explicit such method, but to talk about a 
'datatyping interpretation' that assigns datatypes in some systematic 
way, and then state interpretation conditions which restrict the 
possible assignments which would make the typed assertions true. This 
last provides the most flexibility and has the others as special 
cases, and is therefore the most general solution, but (unlike the 
first) it provides no principled way to isolate datatype reasoning 
from general inference.

Formally, a datatype scheme D is a set DT of things called types and 
two functions DTS: DT-> (L ->LV) to datatypes and DTC: DT-> powerset 
of LV  to the range of each datatype (integers, strings, etc.), and a 
datatyping of a set is a function from that set into DT, ie an 
assignment of a datatype to everything in the set. A typed 
interpretation <I,D> of a graph is an interpretation I of the 
vocabulary plus a datatyping D of the nodes which satisfies the 
following conditions. (The first condition isn't a mathematical 
condition on the structures involved, but it is required in order to 
make the datatype scheme useable in any web language.) :

1. If nnn is any uri of a datatype, then I(nnn) is in DT.
2. ICEXT(d) is a subset of DTC(d) for any d in (DT intersect IR)
3. LV(n)=DTS(D(n))(label(n))

Notice that n is a node and label(n) is its label, ie the literal 
itself, and that D occurs only in equation 3.

This provides just the amount of alignment between datatyping and 
interpretations to allow things like rdfs:range assertions to 
restrict the ICEXT mappings sufficiently to 'force' the node labelled 
with a literal to be properly typed. In effect, you can think of the 
datatyping D as a kind of variable which gets restricted by the 
various assertions made by a graph in just the right way to 'select' 
the proper way to interpret the literals. If there isn't enough 
information to do that, then its not completely clear what the RDF 
assertion is saying; but then its not entirely clear what any RDF 
assertion is 'really' saying, and in this case the relevant options 
are are least relatively clear. The relevant information can come 
from anywhere, in general, but we can restrict that by adding further 
conditions.

To get the first, 'explicit syntactic' kind of datatyping, you just 
add one more condition, which might be written as:

4. D(n)=I(type-label(n))

for every literal node n in the graph. If you substitute this into 3 you get

LV(n)= DTS(I(type-label(n)))(label(n))

and the mapping D then is completely eliminated from the equations; 
which shows that in this syntactically restricted case we don't 
really need to consider explicit datatypings at all; but, their use 
does not forbid syntactic datatyping if one wants to use it, and 
indeed this shows that both syntactic and interpreted datatyping 
techniques can be used together without interfering with one another.

The 'bnode' suggestions can also be handled in this framework, as a 
rather peculiar-seeming semantic condition on rdf:value:

5. <x, y> is in IEXT(I(rdf:value)) iff x= y

This isn't the 'intended' interpretation, I realize, but it does make 
everything work out right. What this does is to read, for example

_:1 rdf:value 1234 .

not as meaning
  '_:1 is a thing which would be written as the unicode string "1234" '
  but rather as
'_:1 is a thing which is gotten by interpreting the string "1234" 
(using the correct datatyping scheme)'
   which of course is just saying that _:1 is equal to 1234 (using the 
correct datatyping scheme). However, the conditions 1-3 above 
guarantee that if _:1 is known to be in a class identified by a 
datatype uri, then an appropriate datatyping scheme will be used, so 
there is no need to say that explicitly.

Notice that the 'intended' reading is semantically anomalous since it 
requires us to take the literal, er, literally, rather than 
interpreting it in any way; it has a kind of use/mention glitch built 
into it. (Admittedly this is kind of harmless for strings, since they 
do denote themselves; that is why we are able to reinterpret 
rdf:value in the above way as meaning equality, and get away with 
it.) Notice also that it makes rdf:value seem kind of silly; if it 
means equality, and can only be used with literals, why not just 
substitute the literal for the blank node and get rid of the blank 
node? (Current answer: Because that would require us to allow 
literals as subjects if we want to write the equivalent of _:1 
rdf:type xsd:integer . Response: So, lets have literals as subjects, 
why not? Or at any rate, let us face up to the fact that this 
prohibition is purely an ad-hoc syntactic restriction imposed for no 
semantic reason.)

One way or another, this model theory extension seems to be able to 
handle any kind of datatyping that anyone has so far suggested. As (I 
gather from Peter P-S) it can also handle all of XML datatyping, and 
it can handle all of DAML+OIL (in fact it will probably be built into 
the next DAML+OIL model theory), I would suggest that we adopt it as 
standard, therefore.

I will work up a draft extension to the MT document  which covers it 
and explains the alternatives, and then people can discuss it, how's 
that?

Pat

-- 
---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes@ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes
Received on Monday, 22 October 2001 19:15:01 UTC