FW: One final step to datatyping convergence and closure?

--
               
Patrick Stickler              Phone: +358 50 483 9453
Senior Research Scientist     Fax:   +358 7180 35409
Nokia Research Center         Email: patrick.stickler@nokia.com


------ Forwarded Message
From: Patrick Stickler <patrick.stickler@nokia.com>
Date: Wed, 13 Feb 2002 12:53:22 +0200
To: Pat Hayes <phayes@ai.uwf.edu>
Cc: "McBride, Brian" <bwm@hplb.hpl.hp.com>
Subject: Re: One final step to datatyping convergence and closure?

On 2002-02-12 23:42, "ext Pat Hayes" <phayes@ai.uwf.edu> wrote:

>> Thus, those applications which can achieve that merging won't need it.
> 
> I fail to follow this. It might *want* to keep track of the literal
> forms, was my point. For example, it might be important to know that
> this value (of a bnode) has been found somewhere in the corpus
> written in this way (a literal which can be string-matched) using
> these conventions (reference to datatype). If all one wants to do is
> to get at the values, your point might be well-taken; but that isn't
> all that everyone needs to do.

But you can get at both by (internally) simply storing the values
on/in/as the bNodes -- just as you illustrate in your latest
summary, where 35 is on the bNode itself. This is not RDF, but
an implementational representation -- implied by RDF -- and it is
only in a specific implementation that such a representation is
possible, since actual value representations are implementation
specific.

Insofar as RDF is concerned, that actual value is not on/in/eq the
bNode, but an application can put the value there to make queries more
efficient -- and queries based on value comparison would be
binding variables to the values, not to the nodes, so why should
it matter if the nodes are shared/merged/tidy?

>> You seem to be hoping that the datatype triple idiom will get
>> you "values in the graph" but it can't do that reliably and
>> consistently except in a context where you actually *can*
>> achieve values in the implementation-specific graph (e.g. in
>> the internal RDF engine, attach the actual value to the bNode
>> on import,
> 
> Wait, wait. What do you mean, 'attach the actual value'? There is no
> way to attach *actual values* (other than strings) to nodes in an RDF
> graph. We don't have any RDF syntax for actual values.

We have been talking about the implementation space the whole time.
I expected that that has been clear. The whole point of your arguments
has been regarding implementational utility. So of course if I say
'attach a value to the bNode' I'm talking about the implementation
space. I'm not stupid. I've been working with RDF long enough to
know that you can't attach a value to a node in the RDF graph proper,
and I would have expected you to give me that benefit of the doubt
and consider that you had perhaps misunderstood me. Thank you very
much.

And the fact that I said "in the *internal* RDF engine' should
have been a strong hint about the application specific context.

Before you conclude that I've said something clearly stupid, I would
very much appreciate it if you would please consider first that I
haven't, but that you have misunderstood me, and read it again, OK?

> You seem (??) to be assuming a use mode in which RDF is used for
> information interchange between systems that immediately translate it
> into some 'internal' non-RDF form, and that datatyping is primarily
> relevant at the moment of such translation.

Yes and no. Yes insofar as datatype values are concerned, because
RDF can only denote them, but cannot represent them. No, for
pretty much everything else (though I'm sure there are other
exceptions).

> That isn't what I see as
> the typical use of RDF at all: I see it as more like a
> general-purpose way of *recording* information (eg on a website)
> where it can be accessed and used by a variety of processes, which
> will typically themselves produce more RDF, often by manipulating RDF
> content from a number of sources, which will itself get published in
> RDF for use by other agents.

Right. Like HTML and XML record information on a website and
are interpreted in-the-raw rather than transformed into the
DOM, or a SAX even stream, or an XML infoset, or a browser
display pane... right, just like that ;-)

And my arguments speak precisely to such a scenario, such
that the RDF graph *records* knowledge but is not necessarily
the actual literal data structure used by any "RDF engine"
to store such knowledge -- but rather that "RDF engine"
provides an abstract interface emulating the logical RDF
graph structure on top of a (presumably) more efficient internal
representation -- and in the case of knowledge which is implicit
(albeit unambiguously implied) in the graph, such as the actual
value that some TDL pairing denotes, that "RDF engine" may
provide higher level access which is value based rather than
directly idiom/graph based.

The very same higher level "view" can be provided for all
kinds of equalities and equivalences, such as those defined
by rdfs:subPropertyOf, rdfs:subClassOf, daml:equivalentTo,
etc. etc.

>> (a) queries have to be three-part rather than just two-part
> 
> ?? WHY? If one of the idioms entails the other, then you only have to
> query the one that is 'downstream'.

You are presuming an "RDF engine" that has built-in RDFS support.

The point of a 'local' idiom is to be able to express and query
typed data literals without additional RDFS expressed knowledge
(with the exception being any knowledge mandated by the MT such
as produced by closure rules, etc. which any given implementation
must and can take into account).

That is explicitly in the desiderada, and our proposals/solutions
should address the desiderada.

> If the other idiom is used, the
> query will be satisfied in either case. The doublet form entails the
> triple, but not the reverse. So if both idioms are used, one only
> needs to check the triple form in the query.

The datatype triple idiom is entailed by a closure rule that
depends on explicit RDFS knowledge about the datatyping property
in the graph. The idiom itself, sans this extra, explicit
knowledge not provided by the MT, is not sufficient to recognize
datatyping properties as such.

>> (b) queries have to specify schema statements for the datatype
>>     triple idiom (again, it's not local)
> 
> Again, see previous messages for this. First, not really true;

See my replies... it is true ;-)  Without the rdfs:subClassOf
rdfs:Datatype and rdfs:subPropertyOf rdf:value statements, the
datatype property cannot be generically recongnized as a datatype
property and therefore it is not a local idiom.


>> But not offering the utility you attribute to it. Consider a context
>> of mass syndication of knowledge from many many sources, in real-time.
>> In order for the idiom to offer some representation of equivalent
>> values, merging must be done on the introduction of every new statement,
>> and if such merging is being done, the application is likely going
>> to just insert the values and be done with it.
> 
> In the sense you are using the term here, the bnode IS the 'value'.
> Suppose for example we know that
> 
> _:t34276 rdf:value "the phone number of the man in the red hat" .
> 
> and later we figure out, and add the graph:
> 
> _:t34276 xsd:number "8504348903"

Firstly, it is hard to really consider your example since
you're using fictitious, possibly fanciful datatypes, but
presuming that xsd:number is analogous or equivalent to
xsd:integer, the above case would  be in error, since
"the...red hat" is not a valid lexical form for xsd:integer.


>> Thus, there are very very few conceivable contexts where (a) the application
>> cannot perform the merging itself and (b) it is sure that all possible
>> mergings have occurred. Thus, again, the actual real-world utility of
>> the merged bNodes denoting value equality is an illusion. It just doesn't
>> offer what you think it does. Not in practice.
> 
> I disagree. If I were writing apps, I would be using this all the
> time. In the RKF project if it gets merged with DAML, we will use
> this kind of technique centrally in our reasoning engines (which make
> several slightly naughty closed-world assumptions behind the scenes
> in order to get answers shipped in a reasonable time.)

The key phrase here is "in our reasoning engines".

What happens within/inside a given application is one thing. What
is part of the standard that *everyone* has to use is another.

I have repeatedly said, and I'll say it again, that your motivations
for keeping the datatype triple idiom seem based solely on utility
it *might* offer particular implementations with regards to
*internal* representations relevant to *application specific*
processes -- and that such arguments have little to no basis for
including some functionality in a global standard which is intended
to provide an economical, portable, consistent, and optimal
representation of knowledge.

>> My point was that, just because two datatype triple bNodes
>> are not the same bNode does not mean they don't denote the
>> same value -- unless the RDF engine fully keeps up with
>> all mergings of all equal values,
> 
> Well, all the ones it knows about, sure. But that would be easy to do.

Agreed, but with no need of merged bNodes.

>> But the need for such coreference in the actual graph is limited
>> at best. If an RDF application supports the datatypes in question
>> then it does not care about coreference since it can determine
>> that itself.
> 
> But there may be many other aspects to the entity apart from its
> value. IN the phone-numnber example,

The phone number example is broken. Can you provide another, please?


>>>  That is exactly what I am doing. The RDF *is* the internal knowledge
>>>  base, and the nodes are the canonical internal reps..
>> 
>> Fine. Then use the datatype triple idiom as an application specific
>> representation if you like (though there are better ways). But insofar
>> as knowledge interchange is concerned, it has no utility and therefore
>> no place in the standard.
> 
> BUt the standard isn't ONLY to be used for knowledge interchange
> between applications. It also is intended to be used for recording
> and publishing content; I would say  primarily to be used for that,
> seems to me.

If you really need to capture that explicitly in the graph, then
just merge the doublet bNodes into a bag. Then, an application
knows if/which doublet values are not accounted for and which
are considered equal.

>> Given my comments above and elsewhere, since I consider the utility
>> offered by the datatype triple to not apply in contexts of either
>> mass syndication or datatype aware applications -- which I consider
>> to cover nearly all RDF contexts
> 
> Well, maybe that is the problem here. There are many potential uses
> of the SW that do not fall into those two categories. The B2B
> examples which Tim talks about do not, for example.

Huh?! Of course they do. Please explain how they do not. It's B2B,
eCommerce, Web Services, whatever you want to call it, that has been
screaming the most for strict datatyping in RDF -- and its such
industries that must deal continually with masses of knowledge
coming from disparate sources -- partners, clients, suppliers,
regulatory agencies, etc. etc.

B2B is probably the *best* example of a context with a high degree
of multi-source syndication and acute need of datatype aware
processing.

>> Yes, but inferencing will be based on values, not idioms, and
>> thus the idioms used to *interchange* the knowledge will become
>> transparent or even discarded in the application
> 
> No, no, not at all!! Very important point !! RDF is to be used to
> support inference DIRECTLY. One does inference *in* RDF. And
> inference is based on syntactic forms, which include what we have ben
> calling 'idioms' . They will not become transparent or discarded;
> they are the very medium in which inference takes place, the
> syntactic substrate of inferences. RDF(S) is the 'logic', not
> something that gets converted or translated into some other logic.

Then query by value will never succeed since literals are not
required to be canonical lexical forms.

Since the RDF graph can *never* contain values as syntactic
components, the graph itself will *never* provide all that is
required for determining equivalence of values.

I understand that you want the shared bNode of the datatype
triple idiom to serve that purpose, but it never can do
so reliably, consistently, practically without the help
of datatype aware applications, and in such a context, it
is not needed.

The lexical form is just a means to an end. Applications must
use lexical forms and datatyping idioms because RDF has no
native datatypes and values are not part of the graph grammar.

But any application that cares about typed data literals does
not care about the lexical form, but about the value itself.

Granted, some may find the historical aspects of which lexical
form were used interesting -- and when it comes to re-express
some value in an RDF graph for further interchange, some
lexical form must be chosen (along with an actual datatype)
but these are secondary issues, just like round tripping of
qnames. They are not central to the whole point of typed
data literals. Really.

What you want to accomplish with the datatype triple can be
accomplished in many other (better) ways within a given
application, and the coreference between values need not
be explicit in the RDF graph.


>> Again, I'll repeat, since I seem to be failing to communicate
>> this one point:
>> 
>> 1. We must have a fully local idiom.
>> 2. The datatype triple idiom is not a fully local idiom.
>> 3. The doublet idiom is the only fully local idiom.
>> 4. If we must choose one, we must choose the doublet idiom.
>> 5. There is no real utility offered by the datatype triple idiom.
>> 6. There is no sufficiently motivating reason to include the datatype
>>    triple idiom.
>> 
> 
> None of the idioms are fully local in your sense, and there is
> genuine utility offered by the triple form.

We really do seem to be at loggerheads about these two points.


>> 2. My preference has always been for untidy literals and literals
>>    as subjects (a'la P++) in conjunction with rdfs:range for
>>    global typing.
> 
> Well, Ive given up on literals as subjects.

Fair enough.

>> 3. If you can make that work, great.
>> 
>>    Bob ex:age _:1:"30" .               It's just an (untidy) literal.
>> 
>>    Bob ex:age _:1:"30" .               It's an integer.
>>    _:1:"30" rdf:dtype xsd:integer .
>> 
>>    Bob ex:age _:1:"30" .               It's an integer.
>>    ex:age rdfs:range xsd:integer .
> 
> That could work, sure, and I like that also since the datatyping is
> 'about' the literal, which seems intuitively correct.  But I thought
> that tidy-literals was now kind of a done deal because of Dan C's
> legacy use arguments.

Well, lots of discussion both on the WG list and interest and logic
clarified that there really is not conflict -- that it was up to the
query user or engine to decide whether datatyping was to be taken
into account or not, and that there was even no problem with having
tidy literals in that context since the datatype interpretation
was not based solely on the literal.

Dan never conceded to that evidence, even though everyone else did.

The bNode global idiom was just a way to move forward around the
issue -- a real political compromise.

However, I think that in retrospect, the manditory bNode has one
very nice (unexpected at the time) feature, that the bNode
consistently denotes the actual value, and in a specific
datatype aware implementation, can be replaced or augmented
with the actual value as a means of optimization/enhancement.
Just as your example pics in your summary suggest, where the
value is depicted on the bNode.

>> The datatype triple idiom cannot be distinguished (generically)
>> from any other triple without the schema knowledge.
> 
> NONE of the idioms can be.

I've already addressed this above. The doublet idiom can.

> They are all perfectly well-formed and
> meaningful RDF when used with a non-datatype uriref. One might make
> an informed guess that the use of rdf:dtype is a strong hint that its
> object is intended to be a datatype name, but its only a guess.

It's far from a guess. The range of rdf:dtype is rdf:Datatype, and
that is mandated by the MT, which an application has a right to
presume, even if it is not explicit in the graph.

> In 
> the case of the d.t. triple, you can make an analogous informed
> guess: if the property arc of a triple with a literal object is
> anything other than rdf:value, then it's probably intended to be a
> datatype name.

Uhhhh, like

   xxx foo:widget _:1 .
   _:1 abc:wombat "34971918374" .

where supposedly 'abc:wombat' is a datatype and "34971918374"
is a lexical form of that datatype...? Nope. Wrong.

_:1 is simply some kind of qualified value with some
property 'abc:wombat' (whatever that is) with a literal
value "34971918374". It's *not* evident that it is
a typed data literal, insofar as can be determined
from the actual idiom/subgraph.

Now, if we had either

   xxx foo:widget _:1 .
   _:1 rdf:value "34971918374" .
   _:1 rdf:dtype abc:wombat .

or

   xxx foo:widget _:1 .
   _:1 abc:wombat "34971918374" .

    *plus* somewhere else:

   abc:wombat rdfs:subPropertyOf rdf:value .
   abc:wombat rdfs:subClassOf rdfs:Datatype .

then it *would* be clear that abc:wombat is
a datatype property and "34971918374" is a lexcal form
for that datatype.

Now...  which of the above two idioms is independent of
statements using vocabulary from RDFS?

Which of the two idioms is trully 'local'?

The doublet idiom.
   
>> No. For the doublet idiom, the presence of the rdf:dtype property
>> tells us it is a datatype.
> 
> No, it does not. Really. The semantics only says that rdf:dtype is a
> subproperty of rdf:type, and that IF its object is a datatype THEN
> some special conditions apply. It does not require that it is, in
> fact, a datatype; if it isn't, then those special conditions do not
> apply, is all.

My understanding of the proposal was that the
object of rdf:dtype is an rdfs:Datatype. I.e. that

   rdf:dtype rdfs:range rdfs:Datatype

would be a closure rule (or otherwise specified) in the MT
and should be presumed by all applications whether explicit
in the graph or not.

If this is not the case, then it should be. Otherwise we have no
trully local idiom (per my definition of 'local' which is what
I think the desiderada means by 'local')

I also still think that we need something like rdfs:drange
to differentiate between rdf:type and rdf:dtype assertions.

It may be that I wish to only assert a range constraint
on the value space of a given property, but don't want to
create a whole new non-lexical type to do so. I.e. I
may want to say that all values of ex:age are integer
values, but I don't care about the actual datatype used,
and thus would say

  ex:age rdfs:range xsd:integer .

which simply says that I expect all values to be integers
even if locally typed differently, and that would thus entail
rdf:type and not rdf:dtype.

If I further wanted to constrain property values to lexical
representations of xsd:integer, then I'd say

  ex:age rdfs:drange xsd:integer .

which would entail rdf:dtype for property values.

In the MT we would then specify that the following are 'given'

   rdfs:drange rdfs:subPropertyOf rdfs:range .
   rdfs:drange rdfs:range rdfs:Datatype .

Thus, the combination of the two statements above along
with 

   rdfs:dtype rdfs:range rdfs:Datatype .

allows applications to reliably recognize the doublet idiom
as a datatyping idiom and the URIref value of rdf:dtype
as a datatype irrespective of anything else in the graph.

> but then we 
> could do the same kind of thing for the triple idiom as well.

How? since we cannot
reference specific datatypes in the MT.

Patrick

--
               
Patrick Stickler              Phone: +358 50 483 9453
Senior Research Scientist     Fax:   +358 7180 35409
Nokia Research Center         Email: patrick.stickler@nokia.com


------ End of Forwarded Message

Received on Wednesday, 13 February 2002 05:52:49 UTC