Re: Datatypes: the Bermuda Triangle, and how to fly over it.

Pat,

excellent analysis (once again)! 

Recall that there are three places where datatyping is addressed: in
RDF/XML syntax, in the abstract graph
syntax, and in the model-theoretic interpretation. More precisely, while
defining datatyping we have to explain the mapping between RDF/XML and
the graph syntax, as well as the mapping between the graph syntax and an
interpretation.

You did a great job of demonstrating what it takes to make S and TDL
styles coexist by tuning the mapping between the graph syntax and
interpretations. Doesn't look like a piece of cake...

In [1] I tried to take another path and try to reconcile both styles in
the RDF/XML-to-graph mapping. Although this alternative approach has the
same trouble (or feature) of using pairs for representing typed
elements, it seems simpler, and does not beg for an additional
datatyping layer quite yet...

Sergey

[1] http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Feb/0007.html
[2] http://www-db.stanford.edu/~melnik/rdf/datatyping/

-- 
E-Mail:      melnik@db.stanford.edu (Sergey Melnik)
WWW:         http://www-db.stanford.edu/~melnik
Tel:         OFFICE: 1-650-725-4312 (USA)
Address:     Room 438, Gates, Stanford University, CA 94305, USA



Pat Hayes wrote:
> 
> (I'm CCing this msg to Jim Hendler. Jim, FYI particularly the last 2
> paragraphs, relevance to 'layering'. More on that tomorrow. )
> 
> Sorry this is such a long msg, but I think it might solve a lot of problems.
> 
> Tinkering with the TDL MT, and offline conversations with Graham,
> convince me that there is a central problem with several of the
> proposed datatyping schemes. Basically, rdf:type just doesn't handle
> datatypes properly, no matter how much one wriggles. You can push the
> problem around, but I don't think there is any way of getting rid of
> it without doing something different. I have a proposal for what to
> do.
> 
> First the central problem. (Much of this is a kind of review of what
> other people have noticed, but it helped me get my head clear.)
> Consider any graph with the following kind of 'triangle' with a blank
> node in it (eg see section 3.1 of
> http://lists.w3.org/Archives/Public/www-archive/2002Jan/att-0133/02-tdl.htm,
> or 'idiom B' and 'idiom P' in
> http://lists.w3.org/Archives/Public/www-archive/2002Jan/att-0133/01-s.htm)
> 
> eg:thing---eg:prop--->_:1
> _:1 ---AAA---> UUU
> _:1 ---rdf:type---> DDD
> 
> where AAA is some 'special' name like rdf:namedBy or rdf:value or
> whatever, UUU is a unicode string, ie a literal , and DDD is a
> datatype name. The idea is to state some kind of ingeniously worded
> conditions on interpretations of those three triples which guarantee
> that any satisfying interpretation of such a graph will be one where
> the bnode denotes the value of UUU under the datatype mapping
> associated with the datatype DDD, so that the eg:prop of eg:thing is
> that value. (By 'value' here I mean an element of the value space of
> the datatype, eg for xsd:integer it would be an integer.)
> 
> The problem is that no such conditions *on RDF triples* are going to
> work properly. Whatever we do, we can't keep a value in a Bermuda
> triangle. Something will fail and the right value will be lost, never
> to be seen again.
> 
> Here's why.  In order for the first triple to mean what it is
> supposed to mean, the blank node has to denote the value of UUU under
> the datatype mapping identified by DDD; but that value does not
> provide enough information to enable one to state semantic conditions
> on the other two triples which would restrict UUU to be interpreted
> by the DDD mapping, since the same value may well be the value of a
> different literal under a different datatype mapping.
> 
> TDL fixes the last problem by making the bnode denote a pair; but
> then the first triple doesn't make sense. (But we hope some other
> application software will figure it out. Hmm, I wonder if all
> application software is savvy enough to do that...) In Patrick's
> terminology, what we need is some way to state conditions which
> enable the pairing of UUU and DDD to together impose a value on the
> bnode; and the problem is that if that they are only connected at the
> middle blank node, and if its value is the one that any other triple
> would expect it to be (eg Bill eg:age _:x expects the value to be an
> integer, say) then it can't do the pairing all by itself.
> 
> Now, it would be possible to fix this problem by stating appropriate
> semantic conditions on the entire subgraph containing the last two
> triples (details below). That option amounts to implicitly giving
> such 'datyping doublets' a special datyped-RDF status, and assigning
> them a meaning that goes beyond their meanings as two separate
> triples, so that a 'datatype doublet' would be more than just a
> conjunction (in a datatype-sensitive interpretation); it really
> amounts to assigning those graphs a special semantic status that
> stands outside the conventional model theory in a particular way.
> That is a slightly unconventional idea in normal logical model
> theory, but it seems to work (and it might be a neat way to approach
> 'layering' on RDF in any case, an idea I am trying to write up for
> the webont WG). I'll come back to this proposal later.
> 
> However, even if we were to state such 'doublet conditions', we still
> have another problem. If DDD is an rdfs:subClassOf EEE, then this
> graph entails another triangle:
> 
> eg:thing---eg:prop--->_:1
> _:1---AAA--->UUU
> _:1---rdf:type--->DDD
> _:1---rdf:type--->EEE
> 
> and now any condition that we can state on AAA, UUU and DDD in order
> to restrict I(UUU) to have the relationship to I(_:1) sanctioned by
> I(DDD) is going to apply just as well to I(EEE).  Even if we state
> conditions on doublets, there are two doublets in this graph.  But if
> EEE is a datatype, there is no guarantee that the lexical-to-value
> mapping associated with EEE is compatible with that associated with
> DDD.
> 
> One reaction is to just stop this happening by some kind of fiat:
> forbid subClassOf to be applied to datatypes ....
> (objection: in order to make use of RDFS reasoning on datatypes, we
> presumably want to be able to say that one datatype is a sub-datatype
> of another, and it seems natural to do that by saying that one of
> them is an rdfs:subClassOf the other. In fact, its more than natural:
> its the *only* way we have of doing it, if datatypes are classes.
> Another objection: it might follow from normal class reasoning that
> one datatype value space is a subset of another. After all, they are,
> as a matter of fact.)
> .....or declare that we will only allow the use of rationally
> constructed datatype schemes....
> (objection: XSD isn't rational enough.)
> But none of these seem satisfactory as a solution.
> 
> Now, the key point about this second problem is not the details of
> the semantic conditions or the use of bnodes or whatever: it is RDFS
> class inheritance. That is what causes all this trouble.  And that
> arises from our using rdf:type in the triangle in the first place.
> Even if we allowed literals as subjects, so that we could put the
> literal and the datatype name in a single triple (which is about as
> 'local' as one could get in a triples notation with datatype class
> names) :
> 
> UUU rdf:type DDD .
> 
> the problem would still be there.
> 
> So here is my proposal to fix this second problem: DON'T use rdf:type
> to associate datatypes with nodes. Instead use a new property, which
> I will call rdf:dtype. This can be required to be an
> rdfs:subPropertyOf rdf:type, so we can do a fair amount of
> class-membership reasoning using it. But now the bad inferences are
> blocked, because
> 
> foo rdf:dtype DDD .
> DDD rdfs:subClassOf EEE .
> 
> does not entail
> 
> foo rdf:dtype EEE .
> 
> The utility of this is that it enables us to carefully limit the
> extent to which this datatyping property gets inherited. For example,
> we might want to impose a 'ranging condition':
> 
> If:
> foo rdf:Range DDD
> xxx foo yyy
> then:
> yyy rdf:dtype DDD
> 
> whenever I(DDD) is a datatype. Then we can use ordinary subclass
> reasoning on the ranges, using rdfs:subClass, but that will only give
> us assertions using rdf:type, not rdf:dtype, so they will not get the
> datatyping constraints confused.  (Recall that foo rdf:Range baz and
> baz rdfs:subClassOf bar does *not* entail foo rdf:Range bar, even
> though rdf:type is inherited by superclasses.)
> 
> I think that is the only change this would require to the current MT
> would be to add
> 
> rdf:dtype rdfs:subPropertyOf rdf:type .
> 
> to #1 in the RDFS closure.
> 
> It would be perfectly *permissible* to use rdf:type in a doublet, of
> course, and in 'simple' cases it would have the same effect (which I
> am willing to bet would cover all the legacy uses)  but it is
> potentially dangerous, since it might lead to inconsistencies  with
> more complicated datatype hierarchies. (Notice the it would be quite
> harmless to conclude that DDD was a subclass of EEE as long as EEE
> wasn't a datatype class, that would have no effect an anything.) So
> we can still allow it, but deprecate it in favor of rdf:dtype. The
> 'simple' cases would be those in which rdf:type rdfs:subPropertyOf
> rdf:dtype, so that the two were equivalent in meaning; and if someone
> were willing to assert that, then they could use rdf:type freely to
> attach datatypes to literals and it would work just fine from normal
> rdfs reasoning, as long as there were no datatype mapping inheritance
> inconsistencies.  (And even then, all that would happen is that some
> particularly dumb RDF graphs might give silly or inconsistent
> results; the reasoners wouldn't actually break.) Of course, a DPH
> could achieve the same effect, and run the same risks, just by
> ignoring the difference between rdf:type and rdf:dtype, and we could
> even point this out to said DPH, with our blessing.
> 
> Now, with this change, we can solve the Bermuda triangle property by
> stating semantic conditions on doublets as follows. (This is an
> adaptation of an idea from Graham):
> 
> A dt-interpretation I of E   - with respect to some externally
> defined set D of datatypes, where each datatype d in D has an
> associated lexical-to-value mapping LV(d)  - is an
> rdfs-interpretation of E where for every doublet
> 
> S rdf:value "uuu"
> S rdf:dtype ddd
> 
> in E, if I(ddd) is in D, then I(S)=LV(I(ddd))(uuu)
> 
> Notice that this really is a condition on the doublet because it
> mentions both ddd and uuu. It is consistent with I("uuu") being the
> string uuu, so we could make graphs tidy on literals (keeps Sergey
> and Dan C happy and simplifies the graph syntax.)  It is a
> restriction on an rdfs-interpretation, so every dt-interpretation is
> an rdfs-interpretation, so dt-entailment is a strengthening of
> rdfs-entailment, as we would want.
> 
> BTW, the easiest way to understand rdf:value here would be that it is
> a union (disjunction) of inverses of canonical submappings of the
> datatype mappings in D, ie IEXT(I(rdf:value)) = {<x,y> : for some
> datatype d in D, LV(d)(y)=x} . But this really doesn't matter all
> that much (as long as it can be given some consistent
> interpretation), since the only significant role of the rdf:value
> triple is to be part of the datatyping doublet. Also, we could call
> that property anything we like, as long as it is reserved for this
> special use.
> 
> Notice that uses exactly the same RDF graph as the TDL proposal
> (apart from the rdf:dtype change). Its just a different MT.  It works
> pretty much the same way for rdfs:Range as well.
> 
> As stated, this only works for 'local typing', ie where the (rdfs
> closure of the) graph contains an actual triple with "rdf:dtype" in
> it.  What about ranges and so on? Well, we can introduce yet a
> further semantic extension, which would be a range-dt-interpretation,
> which also satisfies the ranging condition described above, ie that
> if both
> 
> fff rdf:Range ddd
> xxx fff yyy
> 
> are true then
> 
> yyy rdf:dtype ddd
> 
> must be true also. Obviously this has a corresponding closure rule
> that would add the appropriate rdf:dtype assertions to a graph, and
> then we can characterize 'remote' datatyping by saying that it
> amounts to doing local datatyping in the range-dt-closure of the
> graph. Then we would have an entailment lemma for each kind of
> datatyping as an extension of the previous kind, extending the
> 'layers' that we already have for simple/rdf/rdfs/...entailment.
> Expressed more operationally, that is just saying that in order to
> check 'remote' datatyping, we need to do some extra inferencing of a
> particular kind. Most of it will be normal rdfs inference, but it
> needs to pay special attention to rdf:dtype.
> 
> We could even express something like a preference for local over
> global, by saying that one is supposed to check dt-entailment before
> checking for range-dt-entailment. That makes the distinction
> rigorously clear in MT terms without actually including barbaric
> things like defaults into the MT itself :-) Notice that something can
> be range-dt-inconsistent but still be dt-consistent; that would be a
> range d.t. contradicting a local d.t.. But if we decide to stick with
> simple dt-entailment and only worry about that, and never check
> range-dt-entailment, then we will never notice the higher-level
> inconsistency, and indeed the graph will be consistent as far as we
> are concerned. And all these graphs are perfectly fine seen as simple
> RDFS, though they aren't saying much about the values of the bnodes
> in the triangles when looked at from that low a level.
> 
> <datatyping proposal />
>   <more general issue>
> 
> Now, all this business of defining more and more different notions of
> interpretation and associated kinds of entailment might seem kind of
> a crock: after all, RDF is supposed to be a single language, not a
> whole series of languages, right? But in fact, I would argue that it
> is appropriate, since RDF(S) is supposed to be a 'ground layer' for
> the whole SWeb, so one would expect that it should support a whole
> lot of different kinds of interpretation, as one puts more and more
> things into it that are supposed to have extra meanings over and
> above the basic RDF meaning. In a sense, it *is* a series of
> languages, but which are all encoded in the same RDF triples-graph
> syntax, so we have to define a whole range of different kinds of
> entailment for the same syntax.  And this is all OK, as long as those
> special 'extra' meanings are indeed genuinely extra, ie if you ignore
> the extra constraints, then the RDF triples that are there can be
> understood as normal RDF triples at some 'lower' level of the layer
> cake, and all their conclusions will be valid there (and will not
> screw up any higher levels). Which they are in this case. (Though not
> when we get to DAML, which is a problem for the webont WG; but that's
> another story.) Then each 'layer' can be seen as a licence to use
> some extra kind of inferencing or processing power on the same graph,
> rather than an extension to the notation/syntax of the language.
> Presumably the licence to use such extra power would come from some
> kind of enclosing tags, like the <daml> tags that DAML uses; or, we
> could introduce the new vocabulary with a new Qname prefix, and write
> rdfd:type and rdfd:value to indicate that they constitute a new
> name-space with special meanings, maybe. (Can we, the RDF WG, do
> that?).
> 
> The overall picture one gets is something like Tim's famous
> layer-cake, but with finer structure in the layers. (But then we
> ought to expect that when we get closer, we can see finer structure,
> right?) Like this:
> 
> Full datatype entailment (where datatypes on rdf:ranges of properties
> can fix meanings of literals)
> ---------------------------------
> Local datatype entailment (where only the explicit typing triples fix
> meanings of literals)
> ---------------------------------
> RDFS entailment (where the rdf: and rdfs: vocabulary is properly
> interpreted but literals are just strings)
> ---------------------------------
> RDF entailment (where rdf:type and rdf:Property is properly
> interpreted: it might not be worth distinguishing this layer, in
> practice.)
> ---------------------------------
> Simple RDF entailment(just from the raw triples)
> ---------------------------------
> (I guess XML is under here somewhere :-)
> 
> Its the same syntax everywhere: sets of RDF triples. But each layer
> adds some inferential power to the layer below, expressed in the MT
> as a new set of semantic constraints on allowable interpretations
> imposed on some part of the vocabulary. When those restrictions have
> to mention more than one triple (as here) then in a sense we could
> say that this amounts to encoding a different, higher-level, syntax
> in the RDF graph, but we aren't *obliged* to talk about it that way.
> 
> BTW, since the top two are relative to D, this is really a tree, if
> there are several datatyping schemes in use. But RDFS is all in the
> trunk.
> 
> Pat
> 
> PS. More later on the S idioms and how they can be included into this
> picture. They don't need the use of rdf:dtype to avoid inappropriate
> inheritance.
> 
> --
> ---------------------------------------------------------------------
> IHMC                                    (850)434 8903   home
> 40 South Alcaniz St.                    (850)202 4416   office
> Pensacola,  FL 32501                    (850)202 4440   fax
> phayes@ai.uwf.edu
> http://www.coginst.uwf.edu/~phayes

Received on Friday, 1 February 2002 15:10:42 UTC