Re: On equivalence of tidy/untidy (was: Re: Reopening tidy/untidy decision) from Frank Manola on 2002-10-01 (w3c-rdfcore-wg@w3.org from September 2002)

From: Frank Manola <fmanola@mitre.org>
Date: Mon, 30 Sep 2002 21:11:59 -0400
To: Sergey Melnik <melnik@db.stanford.edu>
CC: Brian McBride <bwm@hplb.hpl.hp.com>, Patrick Stickler <patrick.stickler@nokia.com>, w3c-rdfcore-wg@w3.org, ext Eric Miller <em@w3.org>
Message-ID: <3D98F65F.1090804@mitre.org>
Sergey Melnik wrote:

> Summary: it seems that tidy/untidy is an implementation detail...


Sergey--

I don't think your "tidy/untidy" distinction is the one we're talking 
about.  More below.


> 
> Frank Manola wrote:
> 
>> Sergey--
>>
>> I'd like to see some further discussion of points (a) and (3) you're 
>> making here, since I think that, while they are key points, I don't 
>> feel that they are entirely "substantiated" (at least not yet to my 
>> satisfaction), and I'd like some more details.  So adding this stuff 
>> to the document is great. I don't feel the same about point (b) 
>> because I agree with it, but I don't think it matters that much.  I 
>> don't think anyone has claimed that, via specifying a datatype like 
>> integer for a value, you are going to capture all the application 
>> semantics that are associated with the use of that value in a 
>> property, and hence automatically forbid things like comparing ages 
>> and shoe sizes.
> 
> 
> 
> My impression was that key arguments for untidiness built on the 
> assumption that using strings as ranges of properties such as dc:Creator 
> or :age was inacceptable, and had to be effectively forbidden by 
> treating untyped literals as a kind of labeled existential variables. 
> All I wanted to clarify is that doing so simply elevates the problem of 
> heterogeneity one level higher, and does not help applications to 
> interoperate.


I believe that some of Patrick's arguments in favor of untidiness tended 
to go too far in this direction, but I think your argument here is too 
"binary".  That is, it seems to say that providing *some* additional 
information in the form of a datatype doesn't help interoperability; 
you must either provide *all* the semantics (a never-ending task), or 
you do no good.  I don't agree with this.  While the information that a 
given literal value is intended to be an integer doesn't help me 
distinguish between an integer age value and an integer shoesize value 
(ignoring the property names in this case), it *does* tell the 
application that you intend for the value to be operated on in certain 
ways (based on the semantics of the integer datatype) as opposed to 
others, information that wouldn't be available if you just gave the 
application a string.  Strings can be useful too, of course, and 
applications can be designed to interpret them as integers.  The point 
is simply that a datatype provides added information as to your intent. 
  However, this doesn't really get at the key issue.  See below.


> 
>> If you want to go to additional lengths to further specify the types 
>> (like defining types for age and shoe size, as some people would do), 
>> you can further constrain the interpretations, but clearly most people 
>> draw the line somewhere.  Not to mention the fact that you might not 
>> want to preclude yourself from doing some data mining type of 
>> operation that you hadn't thought of when you designed the type system 
>> that involves comparing people's ages and shoe sizes [this gets into 
>> my point about wanting different comparison operators, which I'll not 
>> get into here].  It seems to me the point we're trying to address here 
>> is somewhat simpler:  we've now introduced a datatype facility into 
>> RDF, where literals can be typed in several ways.  The question is 
>> (unless I'm mistaken), how does *RDF* interpret those literals that 
>> haven't been explicitly assigned a datatype by one of these 
>> mechanisms?  Do we say they have an implicit datatype of some sort (or 
>> have a fixed interpretation in some other way), or do we say they are 
>> the lexical things we talk about in the datatype facility, but we 
>> don't know what type they are?  Either way, applications are going to 
>> associate additional semantics with the values they get from RDF, and 
>> RDF won't know anything about those semantics.
> 
> 
> 
> I absolutely agree with your conclusion. I think part of the problem is 
> that "RDF" does not interpret anything ;) Now, seriously, imagine that 
> there is an application layer that is common to every RDF application 
> (this is where "RDF" interpretation kicks in). This layers is capable of 
> parsing RDF/XML documents into graphs, and provides a set of routines 
> for traversing and updating the graphs. (This is, I guess, a rough 
> characterization of what "RDF APIs" currently do). This "API" layer has 
> no schema support, knows nothing about rules, and has to built-in 
> semantics of any RDF properties.
> 
> As you formulated the question above, we are talking about two ways of 
> implementing this API layer. In one case, all occurrences of an untyped 
> literal having the same string content map to one graph node, in the 
> other case, each occurrence results in a separate node. These separate 
> nodes have internal structure: they contain a single string label. 
> Notice that even if they contain say some system IDs in a concrete 
> implementation, these IDs are supposed to be transparent to applications 
> and the layer itself: each such ID can be replaced by another unique ID 
> without change in semantics.
> 
> The funny thing is that both ways of dealing with the untyped literals 
> sketched above are isomorphic. In more formal terms, the information 
> capacity of each of the two data models is equivalent. That is, there is 
> a bijective function between the set of "tidy" graphs and the set of 
> "untidy" graphs. In fact, each edge of an untidy graph (s, p, o), where 
> o is an untidy literal, can be mapped to an edge (s, p, 
> stringValueOf(o)) of a tidy graph. A reverse mapping takes (s, p, o) as 
> input that creates (s, p, uniqueUntidy(o)) for each untyped o.
> 
> The above effectively proves that each conceivable application that 
> assumes untidy (or tidy) semantics behaves equivalently if we change the 
> graph semantics to tidy (or untidy) and plug in an intermediate 
> "conversion" layer between the application and the original untidy (or 
> tidy) API layer. That is, "RDF" does not care about (un)tidiness. 
> Consider the following "Melnik" test (modestly called after Turing test):
> 
> Given: an application X that communicates with the external world using 
> RDF/XML documents.
> Goal: find out whether X assumes tidy or untidy semantics for untyped 
> literals.
> 
> My conjecture is that there is no way to distinguish whether an 
> application deploys tidy or untidy semantics. Therefore, it's an 
> implementation detail, which matters only for defining a standard, 
> W3C-blessed RDF API, and is irrelevant for the spec we are working on.
> 


I may be wrong, but it seems to me that what you're talking about above 
isn't the tidy/untidy distinction we're trying to sort out.  You seem to 
be distinguishing between different ways of *representing* the literal 
values.  We've been trying to distinguish between different *semantics*. 
  The fact that the two approaches you describe behave the same way 
seems to illustrate that you're only talking about one kind of 
semantics. This  potential confusion is one of the reasons why, in an 
earlier message, I asked that people clarify what definitions of "tidy" 
and "untidy" they were using.  To state the tidy/untidy distinction I 
think we're trying to deal with, in an earlier message to rdf-comments 
(using the <Jenny> <ageinyears> "10" example), Brian characterized the 
difference as:

tidy--the <ageinyears> property takes a value which is a numeral, i.e., 
a string

  untidy--the <ageinyears> property takes a value which is some datatype 
value whose string representation is "10", but without further 
information, such as a range constraint, we can't tell exactly what the 
value is, e.g., the string might be in octal.

Similarly, the last RDF Datatyping working proposal (Part 1 of which we 
have already adopted) says, in Part 2 section C:

"If inline literals are to be addressed by RDF Datatyping, then a choice 
must be made between interpreting inline literals as having string 
semantics (also called tidy semantics) such that each literal would be 
treated as a global string constrant, or interpreting literals as having 
value semantics (also called untidy semantics) such that the literal is 
taken to denote a datatype value and its interpretation depends upon the 
context within which it is used, such as the property and any datatype 
range defined for the property."

It's not clear to me that the two alternative implementions you've 
described get at this distinction.

--Frank



-- 
Frank Manola                   The MITRE Corporation
202 Burlington Road, MS A345   Bedford, MA 01730-1420
mailto:fmanola@mitre.org       voice: 781-271-8147   FAX: 781-271-875
Received on Monday, 30 September 2002 20:56:50 UTC