RE: The X Datatype Proposal

> -----Original Message-----
> From: ext Pat Hayes [mailto:phayes@ai.uwf.edu]
> Sent: 14 November, 2001 01:44
> To: Stickler Patrick (NRC/Tampere)
> Cc: w3c-rdfcore-wg@w3.org
> Subject: Re: The X Datatype Proposal
> 
> 
> >           Definition of X Proposal, with examples
> 
> ....
> 
> >GLOSSARY OF TERMS
> >
> >representation space
> >
> >         A set of concrete representations mapping to values in a
> >         value space which facilitate automated operations
> >         in terms of those values -- e.g. the reification of
> >         a value space within an computer system
> 
> If I follow you, this is what I was calling a datype mapping, ie a 
> mapping from a domain of lexical literal forms into a set of literal 
> values; an example might be the standard mapping from decimal 
> numerals to natural numbers, right?

Right. But I don't see how it is possible (or even useful) to try
to define such a mapping in or for RDF, because to do so requires
defining a canonical representation for all values in a given
value space which means RDF having its own native, internal
data type scheme.

Since RDF itself is not an application, and applications interpret
RDF encoded data, all that can be accomplished is a mapping from
lexical space to canonical lexical space, which will still require
a mapping from that RDF defined canonical lexical space into the
internal representation space of an application.

I think that by trying to define that latter mapping, we are stepping
outside the reasonable bounds of "RDF Space".
  

> >canonical lexical space
> >
> >         A lexical space where each value in the value space
> >         has only one possible representation in the lexical space
> 
> I fail to follow the distinction between 'representation' and 
> 'lexical' in your usage.

A representation need not (necessarily) be a lexical representation. 
It could be e.g. a binary value within a computer system. I.e. the 
values are not represented by lexical forms encoded as strings which
must be parsed and interpreted to obtain the value. It *might*
be a lexical representation, but not necessarily. 

All internal representations in a computer are canonical, in 
that any given value has but one realization in the system for
a given value space, and in fact that realization may serve
as the intersection of several value spaces.

A lexical representation need not be canonical. E.g. "05" and "5"
are both lexical representations that map to the integer value
'five' but the internal representation in my computer may be
the sequence of binary digits '101'.

A given application could e.g. use 'lit:' defined URVs for its 
representation space, since such URVs are required to define
canonical lexical spaces.

> >data type
> >
> >         An explicit lexical space whose members map to
> >         values in an explicit value space
> >
> >(RDF) literal
> >
> >         A string
> >
> >typed (RDF) literal
> >
> >         A lexical form
> >
> >local type
> >
> >         A data type associated directly with an occurrence of a
> >         value serving as the object of a statement
> 
> 1. I do not know what 'associated directly' means.

  {some literal} rdf:type {some data type} .

> 2. Why is the datype - a *lexical* space - associated with the 
> occurrence of a *value* ??

Sorry, bad choice of words. "value" here means "property value"
not "member of a data type value space". Apologies.

> >prescriptive range
> >
> >         A range constraint for a particular predicate 
> defining a global
> >         type which all local types for all values must be 
> equivalent to
> >         (either identical to, or a subclass of, the defined 
> range class)
> 
> I see no difference here between prescriptive and descriptive. The 
> former seems to be the same as the latter with the provisio added 
> that everything must be consistent; but that is a vacuous condition 
> in an assertional language.

There is a *huge* difference. It's as significant as the difference
between XML well formedness and XML validity.

Just because an instance is well formed, does not mean it is
valid.

Just because some literal is assigned a type does not mean that
the type is acceptable.

The ambiguity arises here because rdfs:range is used for *both*
purposes, depending on context, to assign a type to a literal
or to constrain the type of a literal. I.e.

Context                       Application
----------------------------------------------------------
Local type + property range   Prescriptive (type ~ range)
Local type only               n/a
Property range only           Descriptive (range -> type)

I don't know how to explain it any more clearly than that. 

The difference is significant.

Perhaps someone else who groks this can offer a better
explaination (maybe in mathematical terms).

> >node facet
> >
> >         A primitive property of a graph node serving as the
> >         label of an arc
> 
> ?? What about two different arcs coming out of a single node? I don't 
> see any utility to this idea of a 'facet'.

The reason for calling node properties "facets" is to distinguish
them from RDF properties in general. Facets are primitives of the
graph model, not RDF properties that are defined by RDF constructs
or governed by RDFS property relations. You can't e.g. relate
facets via rdfs:subPropertyOf.

Facets are not members of the class rdfs:Property.

There is no problem with a given node having multiple facets, but
the specific facets for each type of graph node are fixed in the
node model. You cannot define arbitrary facets from arbitrary
ontologies. It is a bounded set defined by the model.

The example implementations for Java and Relation Tables should
make this quite clear.


> >LNode
> >
> >         A node representing a resource labeled by an RDF Literal
> >
> >UNode
> >
> >         A node representing a resource labeled by a URI Reference
> >
> >SNode
> >
> >         A node representing an RDF Statement
> 
> Interesting, I was not aware there were any such nodes.

There are, in my proposed model. This model extends the
concept of bNodes to a taxonomy of graph nodes which provide
the basis for interpretation.

SNodes facilitate the reification and qualification of statements, 
as well as provide a basis for constraining the behavior of query 
and inference processes in the interest of preserving critical
relations between literals (LNodes) and local type definitions
or original statement properties necessary for their reliable
interpretation.

UNodes facilitate the concise definition of compression operations
which are critical for efficient storage and interaction of
RDF encoded knowledge.
 
> >literal match
> >
> >         The binding of a statement to a query where the 
> statement and
> >         query are expressed in the same vocabulary and in 
> terms of the
> >         same data typing scheme
> 
> We don't really have any notion of 'query' yet, other than in terms 
> of entailment.

But we need one, IMO, at least insofar as constraints on
the binding of property values to superordinate properties
by inference -- so that critical context needed for interpretation
of property values is not lost.

This proposal offers a minimal but sufficient definition of
such constraints.
 
> >Typed literals constitute lexical forms within a given lexical
> >space and which map to values in a given value space.
> >
> >The proper interpretation of a typed literal requires both the
> >lexical form and the identity of the lexical and value space for
> >which the lexical form is expressed.
> 
> It also requires the mapping between them; what you called the 
> representation space and I earlier called the datatype mapping.

No. RDF must avoid defining such a mapping itself. See my 
arguments in my recent posting


 
> >Separation of a lexical form from either the lexical space or
> >value space for which it was originally expressed renders it
> >uninterpretable in a reliable manner.
> 
> That isn't obvious.

OK, let me try (again) to make it obvious.

If we have

   _:X _:someSubProperty "12" .
   _:someSubProperty rdfs:range foo:hexInt .
   foo:hexInt rdfs:subClassOf xsd:integer .
   _:someSubProperty rdfs:subPropertyOf _:someSuperProperty .
   _:someSuperProperty rdfs:range xsd:integer .

and we have a query

   _:X _:someSuperProperty ?V .

which binds ?V to "12", implying the statement

   _:X _:someSuperProperty "12" .

and then an application attempts to interpret the literal "12"
in terms of the type defined for someSuperProperty by rdfs:range,
namely xsd:integer, it will get the value 'twelve' but in fact, 
the value is actually 'eighteen' !!!

Let's take a similar example, but with more focus on 
lexical space compatibility:

If we have

   _:X _:someSubProperty "#x12" .
   _:someSubProperty   rdfs:range scm:integer .
   scm:integer rdfs:subClassOf xsd:integer .
   _:someSubProperty   rdfs:subPropertyOf _:someSuperProperty .
   _:someSuperProperty rdfs:range xsd:integer .

(note that Scheme integers support lexical representations
in various base notations, not just decimal)

and we have a query

   _:X _:someSuperProperty ?V .

which binds ?V to "12", implying the statement

   _:X _:someSuperProperty "#x12" .

and then an application attempts to interpret the literal "#x12"
in terms of the type defined for someSuperProperty by rdfs:range,
namely xsd:integer, it will get a parse error, as "#x12" is
not a member of the lexical space for xsd:integer.

Does that help make it a bit more obvious?

> >The rdfs:range property may function as either prescriptive
> >or descriptive, depending on the presence or absence of a local
> >type for the object of a statement.
> 
> Again, I fail to see the meaning of this distinction.

See discussion above, and please, anyone else feel free to
jump in here to explain this distinction better than I am,
as it's a significant distinction and if we don't all understand
it, we will not arrive at a reasonable solution.

> >In order for rdfs:range to function prescriptively, there must
> >be both:
> >a. a range value defined for the property of a statement
> >b. a local type defined for the object of the statement
> >
> >In the absence of a local type, and in the presence of a range
> >definition for a given property, the type of the object of a 
> statement
> >is taken to be that defined as the range of the property.
> 
> And in the presence of a local type, it is taken to be the local 
> type, provided that is consistent with the range statement, right? 

It is taken as the local type, regardless of the range statement.

A statement is a statement is a statement, and whether that
statement is acceptable in a given context does not effect
the knowledge embodied in that statement.

If I say that "green" is of type xsd:lang, it may be wrong, but
the statement must be preserved, and the type that I give to
the literal must be taken into account in all processing.

The rdfs:range *constraint*, in the context of the presence of
a local data type, allows for one to determine the suitability
of such local typing, not whether the typing is defined at all.

See my table above showing the descriptive vs. prescriptive
application or rdfs:range based on the presence or absence
of a locally defined type.

> The inferences involved are the same in both cases: all the 
> information that can be obtained about the datatype of the literal, 
> by any means, local or global, is combined, provided it is 
> consistent. (If it isn't consistent, something is wrong. )

You are simply missing the critical distinction between
declaration and constraint.

These are not the same.

> >Query processes, while not explicitly defined by the RDF 
> specification,
> >should be taken into account with regards to the representation and
> >interpretation of RDF encoded knowledge.
> >
> >Query processes which employ inference based on rdfs:subPropertyOf
> >relations may bind objects to predicates which are superordinate to
> >the predicate of the original statement.
> >
> >Query processes which employ inference based on rdfs:subClassOf
> >relations may bind literals to types which are superordinate to
> >the type originally defined for the literals.
> >
> >Query processes which bind a non-locally typed literal to a 
> superordinate
> >predicate different from that of the original statement and which
> >may have a range defined which differs from the range defined
> >for the original predicate effectively separate the lexical form
> >embodied in that literal from the lexical space for which it was
> >originally expressed, rendering it uninterpretable in a reliable
> >manner.
> 
> Again,  that begs some important questions.

Yes, some *very* important questions. Namely, how do we preserve
the relations between literal and locally defined type or
untyped literal and the range defined for the property of the 
original statement of which the literal is the object.

This statement-centric based model provides the basis for this, 
and the above constraints ensure that this critical information 
is never lost.
 
And the same representation and mechanisms that provide for
"type safety" also provide for qualification of statements.

A pretty good bargain if you ask me (but of course I'm biased ;-)


> >The basis for the graph representation, and all operations and
> >interpretations, should be the explicit reification of the
> >statement.
> 
> NO!!  I refuse to have anything to do with a proposal that requires 
> global reification just to handle literals. It is unworkable, 
> impossibly baroque, incompatible with all known uses of RDF 
> (including DAML ) and with XML, and semantically confused.

Eh? I think you're having a "knee jerk" reaction here...

Are you telling me that one cannot derive the present resource-centric
graph representation from this model? 

Is not the foundation of the RDF conceptual model based on the
statement?

How is this model more baroque than the present graph model which
requires *two* representations for each statement just to reify
the statement, one that is resource centric and one that is
statement centric?

And it doesn't require global reification in the sense of reification
per the current graph model, which I agree would result in a grossly
baroque and obese graph. And it's not just for handling typing of
literals, the same model addresses that, but also the (IMO critical)
issues of statement qualification (scope, source, authority, etc.)
which I'm sure is of great interest to the community at large.

And it is *NOT* incompatible with any existing RDF applications
as it is trivial to provide a logical resource-centric interpretation
of this model per the current graph model. I.e.


                    application
           ----------------------------
                resource-centric API
           ----------------------------
              statement-centric model
 
Thus, it is not getting in between the current RDF model and
current applications, but providing a foundation below the
current resource-centric graph "view" that provides a better
(IMMHO) basis for addressing the issues of data typing and
statement qualification.

Finally, from the perspective of a software engineer who has to
make all this stuff work, it is *MUCH MORE* workable than the 
present model and provides the explicit mechanisms by which
disparate applications can have a standardized and portable
solution for interchange, query behavior, type integrity,
and even shared, distributed knowledge bases.

The resource-centric view of the present RDF model is useful
for humans, surely, and we can continue to think in terms of
that view, but a statement-centric model is IMO a much better
foundation for RDF to address the many important issues that 
it is presently faced with.

I hope that my examples for statement qualification and
graph compression bear that out.

> >An RDF graph should represent the statements which
> >constitute knowledge,
> 
> Quite.  Not statements that *describe* the statements that 
> represent knowledge.

The proposed graph model does not make statements, it represents
statements.

An SNode is not a statement about an RDF Statement, it is
the model of an RDF Statement.

> There is a well-known dodge referred to in Krep circles as 'escaping 
> to the metalevel'. When things get awkward, just *describe the 
> syntax* rather than trying to get the meaning straight.

That's *not* what this proposal does. Sorry. Nope. Read it again.

It simply inverts the explicit/implicit relation of the 
resource-centric view and statement-centric view. I.e.

It does not add an additional meta-level not already defined by
the RDF conceptual model for statement reification, it just adopts
reified statements as the key representation of knowledge.

> Syntax is 
> usually better-behaved than meanings, so it will be easier. However, 
> this doesn't solve the problems, it just takes out a kind of 
> intellectual loan. In order to be of actual inferential use, 
> something is going to have to figure out what to actually DO with the 
> expressions that you are now describing.

I believe I've addressed that with regards to qualification of
statements and constraints on query behavior.

The proposed model is precisely intended to allow us to more
easily figure out what to DO with the knowledge.
  
> >and the present RDF graph model should be
> >seen as a higher level resource-centric view or interpretation
> >of that underlying statement-centric graph.
> >
> >Thus, rather than the present graph representation:
> >
> >    [urn:foo] --- urn:someProperty ---> "bar"
> >
> >we should have instead, for every statement, a canonical
> >underlying representation as follows:
> >
> >       [ ]
> >        |
> > ...
> 
> I rest my case.

I don't see that you have a case. Not in terms of your
comments here.

That first example was an abstraction, and in fact is
what is embodied in the resource-centric representation.
And in fact is very similar to the knowledge embodied
in your P++ proposal! E.g.

 <urn:foo> urn:someProperty "bar" .

implies 

 [ nodeID "1"; label <urn:foo> ] 
    [ nodeID "2"; label urn:someProperty ] 
       [ nodeID "3"; label "bar" ] .

which expands in to essentially the same abstraction
(apart from node types):

      [ ]
       |
       ---- ID ----------> 1
       |     
       ---- subject -----> [ ]
       |                    |
       |                    ------ ID ------> 2
       |                    | 
       |                    ------ label ---> <urn:foo>
       |
       ---- predicate ---> [ ]
       |                    |
       |                    ------ ID ------> 3
       |                    |    
       |                    ------ label ---> <urn:someProperty>
       |
       -----object ------> [ ]
                            |
                            ------ ID ------> 4
                            |
                            ------ label ---> "bar"

Thus, we're really talking about comparable models, the
key difference being that in my view, the explicit
statement should be the basis for the model, rather than 
leaving it implicit in some resource-centric view.

The whole problem with the resource-centric representation
is that statements *are* implicit and therefore one cannot
qualify them.

I don't think you are being quite fair here in your dismissal
of this proposal. I don't think you have considered the full
implications of the resource-centric model with regards to
qualification of statements (after all, how can you qualify
something that either doesn't exist explicitly or requires
a secondary, additional representation that is redundant to
the resource-centric representation?!)

Perhaps you can explain, in terms of the current graph model,
how to address the many issues that I have identified. I've 
not yet seen any real solutions based on the current graph
model. At least this proposal *provides* solutions.

Regards,

Patrick

Received on Wednesday, 14 November 2001 05:06:20 UTC