Re: Tidy/untidy: that's all about assumptions, folks

[Patrick Stickler, Nokia/Finland, (+358 40) 801 9690, patrick.stickler@nokia.com]


----- Original Message ----- 
From: "ext Sergey Melnik" <melnik@db.stanford.edu>
To: "RDF Core" <w3c-rdfcore-wg@w3.org>
Sent: 25 September, 2002 20:00
Subject: Tidy/untidy: that's all about assumptions, folks


> 
> In the heat of our argument about tidiness, we seem to be forgetting 
> about a critical assumption that was suggested to justify untidy 
> literals. Below, I'm questioning this assumption. If it holds, than 
> untidy literals are a natural decision to make (and I voted for it last 
> time), if it does not, there is no sufficient justification for 
> introducing untidiness. 

Fair enough.

> What I'm arguing for is that we simply have to 
> remove the prism we've been looking through recently, and untidiness 
> goes away.

Well, as my comments below will suggest, I think it is the tidy
view that is looking at things through a prism, based on the 
unfounded presumption or expectation of canonical lexical forms.

> Recall the motivating example from the RDF 1.0 Spec:
> 
> foo dc:Creator "John Smith"
> 
> Is "John Smith" supposed to represent a person or a string? 

From a data markup perspective, we can say that the form of
the expression is an ambiguous name, since it is not a URIref.

From a knowledge representation perspective, I think it's pretty
intuitive that we're talking about a person in the real world,
and not a string.

If RDF is for data markup, then fine, let's say the meaning of
the object is a string. But then, why not just use XML...

But, as I believe is the case, RDF is for knowledge representation,
then it's rather odd (to use a polite word ;-) to consider that
the object denotes anything other than some thing in the real
world. We simply need to provide the machinery so that this can
be made clear in the RDF itself. 

> The key 
> argument behind untidiness is that "John Smith" (or "10") cannot 
> possibly be meant to be a string, so it has to be something else, whose 
> meaning can be deduced using a right bit of logic and AI.

Right, but the basis for that argument is not really that the literal
cannot possibly mean a string, but that given the role and purpose
of RDF as a language for making statements about the world, it is
rather bizarre for it to mean a string. It is a string in the
RDF/XML because we can't write the integer value ten or the person
John Smith in XML, but surely it is the things in the world that
are meant and not some aspects relating to the form of expression.

> Or, consider our beaten up
> 
> :x age "10"
> :y shoeSize "10"
> 
> Again, the claim of proponents of untidiness is that "10" cannot 
> possibly be meant to denote a string, in both cases. Why? Because we can 
> infer
> 
> :x age :z
> :y shoeSize :z

Well, I think the better example in this case is

  :x title "10"      (string)
  :y age "10"        (integer)

and further

  :z payday "10"     (monthday)
  :q model "10"      (token)

etc.

where the ultimate interpretation conflicts with the 
string equality tests

> supposedly meaning that the age of :x is the shoeSize of :y. Ok, we have 
> datatyping now, so let's do it right:
> 
> :x age int"10"
> :y shoeSize int"10"
> 
> Now we got it! int"10" is not a string now; it's what we want it to 
> mean: an integer. Damn. The entailment
> 
> :x age :z
> :y shoeSize :z
> 
> still holds...

Er, why is that a problem that it holds. It should hold. But given

  :x title <xsd:string>"10"
  :y age <xsd:integer>"10"
  :z payday <xsd:gMonthDay>"10"
  :q model <xsd:token>"10"

then we can be very happy that the following entailment does *not* hold

  :x title :a
  :y age :a
  :z payday :a
  :q model :a

Also, what will the impact be to applications expecting tidy semantics
when

  :x age "10"
  :y shoeSize "010"

does *not* entail

 :x age :z
 :y shoeSize :z

???


> Is there something wrong with the above modeling practice? Should 
> int"10" itself be considered untidy, like those untyped literals?
> Are all those folks who chose the above modeling style dumb? NO, they 
> are not. Above, the properties age and shoeSize are merely used to 
> restrict the valid interpretations of :x and :y. There is no claim that 
> shoeSize is a property that holds between shoes and "shoe sizes". It's a 
> property that holds between shoes and integers, thereby restricting the 
> intepretation of :y. 

Sure. I'm not sure anyone was assuming otherwise.

> Just as well, shoeSize could be defined as a 
> property that holds between shoes and strings/reals/etc.

Well, one problem with saying that the interpretation is based on
a property that holds between resources and lexical representations
is that there is not, and IMO can never be, any restriction against
non-canonical lexical forms. Therefore, even if you were to take a
tidy approach where inline literals denote themselves, you would 
*still* have to evaluate cases such as

   :x :p "10"
   :x :p "10.0"
   :x :p "010"
   :x :p "010.0"
   etc.

in terms of a lexical to value mapping to determine actual equality
of the objects. Thus, the benefit of equality tests on tidy literals
is an illusion that has no guaruntee in the real world. Folks will
be dumping lexical representations as they exist in various auxiliary
systems, and not normalizing them to any canonical representations.
And to expect that one will always encounter the literal "10" when
the integer ten is meant is a fantasy. The literal "10" will probably
be the most common lexical representation for ten, but there is *no*
garuntee that it will be the only lexical representation for ten, and
if we are to support and respect XML Schema datatypes, which allow for
such non-canonical representations, we must accept this reality.

At the end of the day, if some application wants to be absolutely sure
that two lexical representations denote the same thing, they must
evaluate them in terms of either a non-canonical to canonical mapping
or a lexical to value mapping of some datatype.

Just saying the literals are tidy doesn't do it. You still have to
deal with synonymous variants.

> An overwhelming majority of applications use exactly this metaphor. For 
> example, look at the AdobeXMP documentation, where the range of
> xapDynA:Volume is defined to be a Real. Did those folks want to assert 
> that the abstract concept of volume coincides with real numbers? No. Or, 
> what about CC/PP's
> 
> :x displayWidth int"640"  ?
> 
> After all, display width is not measured in integers, but in inches or 
> centimeters...
> 
> My conclusion is that it is not necessary to claim that "John Smith" 
> represents a person (and call for untidy literals), in order to achieve 
> correct modeling. And, by no means have applications and APIs to be 
> changed to reflect this "insight". The applications, and their 
> developers, possess a consistent conceptual model of what dc:Creator or 
> age or shoeSize mean. These apps run just fine. For the lack of 
> conceptual necessity of "thinking untidy" I'm suggesting: don't touch 
> running systems.

If all we cared about were individual closed systems, which just happened 
to use some common API as a convenience, fine, then who cares about
where the interpretation of literals happens.

However, if we are concerned about the interchange of knowledge
between disparate systems, which may or may not have the same
internal implicit assumptions about the interpretation of literals,
then it's a very big deal whether or not the standard MT for
interchange, the RDF MT, specifies what their interpretation is
in a portable, consistent manner.

I believe that the primary purpose of RDF is as a language for
interchange of knowledge (not just structured markup), and as such, 
the more explicitly that meaning can be expressed in that language 
the better. 

Thus, rather than saying that, in the case of 

   :x displayWidth "640" .

the meaning of "640" is some string. I'd rather see a schema that
explicitly states what the interpretation of that string is, e.g.

   displayWidth rdfs:range xsd:integer .
   displayWidth x:unitOfMeasure foo:inch .

etc. etc.

To say that the object denotes some integer value which
is a magnitude of a particular unit of measure, inches.
Now *that* is knowledge that is useful for one system to
tell another.

And surely we don't want to define complex labeled nodes that
embody all that information regarding the interpretation
of values in each occurrence of a value itself! E.g.

((rdf:type,xsd:integer)(x:unitOfMeasure,foo:inch))"640"
((rdf:type,xsd:integer)(x:unitOfMeasure,foo:inch))"100"
((rdf:type,xsd:integer)(x:unitOfMeasure,foo:inch))"30"
((rdf:type,xsd:integer)(x:unitOfMeasure,foo:inch))"4800"
((rdf:type,xsd:integer)(x:unitOfMeasure,foo:inch))"10"
...

How silly.

Rather, the longstanding and accepted best practice is to capture 
the general information at a higher level (the property) and
express only that portion which must be expressed for each
occurrence (the lexical form). This is just plain good design.

Choosing a model for datatyping which encourages ambiguity and
leaves as implicit system-specific interpretations seems to me
to be contrary to the very purpose of RDF.

We already have a standard for structured markup, XML. We don't
need another. RDF is about saying things about the world, and
having the RDF MT assign meaning to literals reflecting the
form of expression rather than their intended denotation in
the world weakens the language and hinders the explicit
and unambiguous interchange of knowledge on the SW.

Patrick

Received on Thursday, 26 September 2002 03:59:18 UTC