Re: Datatyping: questions about TDL proposal from Jeremy Carroll on 2002-01-31 (w3c-rdfcore-wg@w3.org from January 2002)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Thu, 31 Jan 2002 11:00:22 -0000
To: <w3c-rdfcore-wg@w3.org>
Message-ID: <CEECKEAMDAJDDEDGJNBEIEBICAAA.jjc@hpl.hp.com>
Thanks Pat for the detailed analysis.
This message expands significantly on what in the TDL document I glossed
over as "technical reasons".


Short version:
=============

I am not surprised that you were not happy with the TDL model theory.

The basic choice to use pairs rather than just (typed) values in the model
theory was due to limits of my technical competence. I tried and failed to
come up with an account that uses values.

As far as I know, no one actually advocates using pairs.

If you can come up with an account that addresses your Q/Cs about pairs,
using my hack (C16) for union datatypes, and otherwise resurrect the P++
model theory (I think), then please do. I will be relieved.

I provide some motivation for literal-value pairs in my response to your Q8
& Q9 below.
This is of the sort "I didn't want to use pairs, but I had to in order to
..."

My response to C15, in a separate e-mail, gives some idea of why I don't see
the use of literal-value pairs as disastrous, but merely an inconvenient
technicality.

Aside:
======

If I have understood your comments correctly you have not identified any new
fundamental mistakes with my MT (like the self-entailment bug). As such I
achieved my objective: giving an existence proof for there being at least
one formal framework for Patrick's pairing, PD framework. I see your
comments as listing other desirable characteristics that I failed to
achieve.

Detailed response:
==================

> Q1. Definition of TDL refers to a 'pairing'. Does that mean some kind
> of syntactic combination operation, or is this just a mathematical
> definition of some abstract entity? And what exactly is a 'datatype
> identity'?

For Patrick, as I understand it, the key idea is that datatyping is about
pairs.
Each pair being a string from the input document and the type with which to
interpret the string.
I find this pairing a little simplistic, in that in the syntactic idiom of
range constraints for example, many types can be applied. Thus if we wanted
to treat the pairing of a string and its type formally, we would have to
introduce a new anonymous type being the intersection of all the given
types. I considered this and rejected it.
I also considered and rejected suggesting to Patrick that the emphasis on
string-type pairs should be dropped. As a first approximation to what is
going on, I believe the string-type pairing idea helpful, particularly to
the reader who does not wish to understand all the details. I also believe
that the string-type pair is a very plausible implementation technique.

I think the document suffers stylistically because my key input was also a
pairing, but a different one! (The literal-value pair).

> Q2. In the figure immediately below, what is meant by 'internal
> value' and 'application value space'?

I will leave this one to Patrick.
Although see my (separately posted) response to C15, which I think touches
on similar issues.


> C[3.  Style comment. I suggest it would be better to not modify the
> definition of RDF interpretation, but to introduce a new notion of
> 'datatyped interpretation' or whatever. That would enable us to keep
> all the different notions of entailment straight. ]

Good point.
If we conclude that the doc is worth a major revision I will put this on the
to-do list.

> Q4. Terminology section refers to 'before' and 'after' datatyping. I
> have trouble understanding what this means. Do you have in mind that
> there is some kind of process which 'datatypes' an RDF graph? If so,
> what is the difference between the graph before and the graph after
> that operation? Or does this mean something  else altogether?

Yes the terminology is poor.
Try:
 For clarity, we use disjoint terminology for literals in the graph syntax,
and their interpretations and meanings both in the model theory's universe.
[ bullet pointed defns]
 The literal-value pairs occur in the model theory's universe. The intent is
that RDF applications may manipulate either or both of the Unicode string or
the typed value.


> C5. The literal-value pairs are welldefined but seem odd, since they
> pair a unicode string with a semantic value, ie a denotation. Is that
> really what you mean? If so, then a set of these things would be a
> datatype mapping, right?

This is all true.

> C6. "....A datatype class corresponds to its map, ie a set of pairs..."

> Well, OK, but this seems a very odd decision. First, the natural RDF
> object corresponding to a set of pairs is a property (extension), not
> a class. Second, while a property can of course have a class
> extension as well as its property extension, there isnt any implied
> connection between them in RDF, so if you treat a set of pairs as a
> class then that amounts to saying that the fact that it is a class of
> *pairs* is irrelevant to its behavior as a class. Third, there is in
> fact no way to specify in RDF that any particular class is a class of
> pairs; whereas if you had characterized this as a property, then the
> RDF semantic conditions imply that it has a property extension (if it
> is ever used in a triple).

At no point does TDL use a datatype as a property as in S-A. Therefore it
would be odd to include such properties in the model. I am wondering if you
wrote your comments as you were going through or on a second reading. It
feels to me as if in this para above you haven't really appreciated how I am
trying to consistently follow what I agree was an odd decision (C5). Your
later comments seem to have accepted this more.

> Q/C7. Interpretation.  "...the type information is checked by
> requiring this pair to be a member of each class associated with this
> node. "
> What does 'associated with this node' mean?? I think what you mean is
> 'each class which the denotation of the node is required to be a
> member of', right? (That is what a range constrai...sorry, assertion
> of a triple using rdfs:Range, would imply, for example.) If so, that
> is what the RDF MT says already. But notice that according to your
> convention about datatype classes, that says that the node labelled
> with the unicode string denotes a pair, not the value inside the
> pair. Is that really what you want it to say? That would mean that
> the for example the 'same' date written using different date formats
> are different dates, and so on. In fact, as far as identity is
> concerned, it means that any two values from any two distinct
> datatyping schemes are never the same value.

Yes, yes, yes.
This was intended to be consistent with your MT document in its usage of
rdfs:Range.
And yes, your date example illustrates a limitation of TDL (not shared by
S-A).

> Q8. That same paragraph refers to 'untyped Unicode nodes'. Does that
> imply that there are two kinds of Unicode node? If so, how are they
> distinguished in the syntax?

No. There are only the same old (untidy) literals that we have always had.
The untyped ones are just those that are not subject to any type
constraints.
So a straight triple with no constraints like:

_:a <foo> "fred".

The "fred" is an "untyped Unicode node" of this paragraph.
Once we apply an rdfs:range constraint to <foo>, then "fred" is no longer
untyped.

Similarly the two triples:

_:a <foo> _:b .
_:b <rdf:value> "bar" .

leave "bar" as untyped, but an rdf:type triple on _:b, along with the
semantics of <rdf:value> can apply a type to "bar".

I think this is an appropriate place to expand on the "technical reasons".
The TDL MT is intended to systematically follow an open world assumption on
both triples and types.

So when seeing either  the single "fred" triple or the two "bar" triple,
without any other information, we want to have some set of interpretations.
When we also have a range constraint available, and we "know" the datatype
of the range constraint or we have a triple like

_:b <rdf:type> <xsd:string> .

(applying to the "bar" example).
we wish to have a subset of interpretations.

Hence, as I saw it, the set of possible typed values corresponding to an
'untyped Unicode node' (e.g. "fred" in the single triple example) is
unbounded. As intellgible type information is added we monotonically reduce
the set of possible typed values. The cardinality of the set of
possibilities is unbounded in the untyped case and 1 or 0 in the typed case
(excepting union datatypes).

The literal-value pair was motivated partly because I wasn't prepared to
stomach a wholly unrestricted interpretation of 'untyped Unicode nodes'.
Thus the first component is always uniquely resticted and the second
component is restricted or not depending on type information.


> Q9. In section 3.1 example 1, the figure has this new kind of (green
> hexagonal) node in it. What is this thing, exactly? (Is it an
> extension to the RDF graph syntax? Or some kind of external addition
> to the graph?? Or what?  If it is just an annotation, then this is
> one of the old proposals (called DC in
> http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Nov/0295.html);
> but it reinterprets the xsd: classes in an odd way that makes the
> assertions wrong, for some reason best known to you guys :-)

It is just an annotation and is not in the syntax.
This idiom is the DC idiom.

I understand the "makes the assertions wrong" part of the question as:
"But why on earth are you interpreting the bnode as a pair rather than
simply the typed value?"

So answering my (possibly incorrect) rephrasing of your question.

1] I wanted a monotonic semantics. What does

aaa eg:prop _:x .
_:x rdf:value "10" .

mean without an rdf:type?

What does it mean with both an eg:decimalInteger and an eg:octalInteger type
(assuming we have both available).

aaa eg:prop _:x .
_:x rdf:value "10" .
_:x rdf:type eg:decimalInteger .
_:x rdf:type eg:octalInteger .

2] (variant of [1])

I wouldn't want to rule out multiple types, because sometimes it works.

3] Even in the daml idiom

aaa eg:prop _:x .
_:x rdf:value "10" .
_:x rdf:type xsd:integer .

How are we to know that the rdf:value needs to use the xsd:integer mapping
function, and not the eg:octal mapping function. Both would associate an
integer to _:x, so either would satisfy

_:x rdf:type xsd:integer .

if we understand that as merely operating in the value space.

4] Both issues [1], [2], [3] can be replicated for examples based around

aaa eg:prop "10" .

and using range constraints on eg:prop.
This is particularly important when we consider [2].
If we have a data document, a schema document, and a second alternative
schema document all independently authored, all three documents may be
usable together (consistent) despite a lack of direct collaboration. The
ability to ignore minor unimportant variations in range constraints seems
desirable. (And the ability to distinguish between important and unimportant
variations in range constraints).


I believe that the model theory I produced is monotonic and does cover the
whole range of synatctic possibilities, not just the recommended idiom. I
have not understood how Peter's work or your (unfinished) P++ work address
[1], [2], [3], [4] above.


> Q/C10. Model Theoretic Interpretation of local idiom. "...Hence x is
> the integer 30."  OK, but what this  graph asserts is that the age of
> Bob is the pair <"30",30>, right? Not that the age of Bob is 30. (If
> that is wrong, how do the interpretation rules for the Bob ex:age ...
> triple manage to extract the second item in the denoted pair?)

Yes,you have understood. I always use the literal-value pair, everywhere.
There are no bare values in the model, there are no bare literals in the
model.
The phrase quoted "Hence ..." was intended less formally, and perhaps better
have mentioned something about an RDF application treating x as the value
30.

> Also, if what you say about rdf:value is correct, then since the
> unicode node has to have the same denotation as the blank node, and
> since that denotation has to be in the class xsd:integer, it has to
> be a pair; so the unicode node itself has to denote a pair.  And the
> first item of that pair is the unicode string itself, right?


Yes.

> C.11  Relevant to the above: look, you don't need to have *pairs* in
> the class. They are just getting in the way. If the xsd:integer class
> were the value space of the datatype (and if rdf:value were identity)
> then this idiom would work just fine, and Bob's age would be 30.  You
> do need some semantic constraint to interpret the unicode strings
> properly, but then you need that anyway. The pairs don't seem to help
> any.

I agree the pairs are inconvenient.
In your view, how do we *not* get Bob's age to be 24 with an octal reading
of the string "30"?

> Q12. In section 3.2, global idiom: "Per the following, the lexical
> form "30" is required to be a member of the lexical space of the
> datatype xsd:integer". HOW? I really don't see how this works. Since
> xsd:integer denotes a set of pairs, the range of ex:age must be a set
> of pairs, so whatever the unicode node denotes must be a pair. But
> you have it marked as 'representing a value'; and you also say that
> the lexical form is thereby required to be a member of a lexical
> space. As far as I can see, this is saying the following. The unicode
> string denotes a pair <a,x> consisting of a unicode string and a
> datatype value (eg <"13", 13>, which is a member of the extension of
> the xsd:integer datatype mapping), and it thereby 'represents' a
> value, and also simultaneously is 'required' to be a member of a
> lexical space. BUt it can't be all three of "13", 13 and <"13",13> at
> the same time, right?

On the graph there is the literal label "13".
In the model theoretic interpretation there is the pair <"13",13>.
The application may well decide to only use the second component (13) of the
pair from the model theoretic interpretation.

In this sense it is all three at the same time.

Now, I will try an explain HOW the constraint of the string to the lexical
space happens.
This supplements the subsection entitled "Model Theoretic Interpretation of
Global Idiom".

The text '"30" is required to be a member of the lexical space' was
informal, intended for RDF/XML document authors. Clarifying what the
'requirement' is may be as follows:

 "When using this idiom you are required to use a string that is in the
lexical space of the datatype, [ or else your document will be RDFS
inconsistent ]"

I take you to be interested in the last bit about RDFS inconsistency.
I will use an example, in which instead of "30" we will have "foo" as the
literal label.

The interpretation of "foo" is required to be a pair < "foo", x > for some
x.

Using the schema closure rules from RDF model theory we effectively have a
triple of the form:

"foo" rdf:type xsd:integer .

(the literal node "foo" is intended as the very same literal node that is
the object of the ex:age triple. The schema rules as I understand them are
all actually constraints  concerning IEXT, but I find them much more
difficult to state)

This datatype class membership, as always in the TDL MT, is understood as
being about the pair being in the map. There are no pairs < "foo", x > in
the map of the datatype xsd:integer, and hence whatever x we choose the
interpretation did not statisfy the schema constraints.

Thus:
  For RDF (without schema), the two triples are consistent, with
interpretations of the "foo" node being < "foo", x > for arbitrary x
   Whereas for RDFS, the schema processing makes the two triples
inconsistent.


As I understand it, the lexical space of xsd:integer is defined as the
domain of the map.
The example above shows, at least when we have RDF schema processing, the
two triples are inconsistent unless the label of the literal node is in the
domain of the map of the datatype. Hence the label is constrained to be in
the lexical space of the datatype.

> (Also, re.  'required': what if the  unicode string is NOT a member
> of the lexical space of the asserted datatype? Is that just an
> inconsistency?)
Yes.

> C13. (Following on from C11). It seems clear from the diagram that
> what this RDF is supposed to mean is that the range of ex:age is
> integers (according to xsd:integer, ie the value space), and that the
> age of Bob is 30. What's wrong with just declaring that that is
> indeed what it does mean? See
> http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Nov/0011.html.
> The problem with the old P(++) proposals was the nasty flaw detected
> by Patrick , which is that a super-datatype-class of a datatype class
> might have a different lexical-to-value mapping. But your 'pairs'
> proposal has exactly the same problem. In fact it is worse, since if
> the lexical-to-value mapping is different, then the datatypes are not
> even subclasses of one another,  with your convention; so there isn't
> even any way to *say* that one datatype is 'sub' another. (Of course,
> just refusing to say that, say, xxsd:octal is a subclass of
> xxsd:number, or whatever the example is that screws up datatype
> inheritance, was always an option in the old P(++)-style proposal as
> well. )


Ummm, no.

I did look carefully at your referenced message, it's still in my browser
cache!

As indicated under Q9 I saw no way of keeping a model that interprets the
strings solely in the value space while addressing the inconsistent
lexical=>value mappings problem.

I don't think the inconsistent mappings problem is restricted to subclasses,
it occurs whenever two datatypes have overlapping value spaces.

I believe my solution is consistent with XML Schema Datatypes class
hierarchy which I understand as being essentially the subset hierarchy on
the mappings.

i.e. if A is an xsd subtype of B

then
    A.map is subset of B.map

If I have misunderstood that aspect of XML Schema Datatypes TDL model theory
will at least need a revision.

> C14. The paragraph "Whether the rdfs:range statement....property in
> question." isn't going to work in ANY model theory, unless we
> effectively redefine RDF syntax to provide some way to distinguish
> local from global. The MT has to be defined on triples, not on
> triples in some kind of undefined 'context'. (How far out do we have
> to look in the graph, or on the web, to see if there is a 'more
> local' assertion?)

This paragraph is again intended for RDF/XML document & RDF schema authors.

If a schema author uses a range constraint there has been a long standing
discussion as to whether this is "a constraint" that documents must satisfy
or a means for generating implicit triples. Of course, this long standing
discussion is groundless. However, there is a community that expects clarity
on this issue.

For you, Pat, I will deconstruct the informal text, in model theoretic
terms. However, I do not think it appropriate that our documents should be
targetted at model theoreticians alone. The TDL document has clearly
identified sections with model theory, the other sections are, and are
intended as less formal. I have proof-read them to check that there is (IMO)
a consistent reading with the model theory.

"Whether the rdfs:range statement constitutes a constraint on the allowed
datatypes depends on whether there exists any local (explicit) type
assignment."
Model theoretically when both global range constraints and local type
triples are present in a graph both apply.
"If there is no local typing for the literal value whatsoever, then
rdfs:range can only serve as a global (implicit) type assignment."
If there is no local typing, only global typing applies, and a contradition
based on two types clashing is not possible. (This doesn't exclude a
contradiction because the unicode string is not in the lexical space of the
type).
"However, if the literal has one or more types defined locally, and any
locally specified datatype is not compatible with all datatypes globally
implied by rdfs:range for the property, one can treat such a case as a
contradition to a constraint on the expected or required datatype(s) for the
property in question."
When both local and global types are given, a possible contradition is that
the intersection of the mappings of all applicable types is empty, (or does
not include the unicode string in its domain).
Consider when:
 - all the triples apart from the local type triple are consistent
 - all the triples including the local type triple are inconsistent.
This constrains the local type in the sense that a different choice of local
type by the RDF/XML document author would result in a consistent document.



> C15. The list of Satisfactions looks good, but omits the one rather
> central one which I guess people didnt think to write out explicitly:
> that the idiom used actually means what it ought to mean.

I understand this being a plea to use values rather than literal-value pairs
in the model.
See separate response.

> C16. Neat hack for almost handling union datatypes, ie ignore the
> part that gives all the trouble. If I can use that same hack, I can
> do them too :-)

Please do. Notice this is part of the systematic monotonicity of TDL model
theory.


See ya

Jeremy
Received on Thursday, 31 January 2002 06:00:58 UTC