RE: on documents and terms [was: RE: [WNET] new proposal WN URIs and related issues] from Pat Hayes on 2006-04-25 (public-swbp-wg@w3.org from April 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Tue, 25 Apr 2006 11:12:23 -0500
To: "Booth, David (HP Software - Boston)" <dbooth@hp.com>
Cc: "Pat Hayes" <phayes@ihmc.us>, <public-swbp-wg@w3.org>, "Frank Manola" <fmanola@acm.org>
Message-Id: <p06230901c073f1ab5115@[10.100.0.24]>
>  > From: Pat Hayes [mailto:phayes@ihmc.us]
>>  . . .
>>  Not minor at all, Frank. This might get to one of the hearts
>>  of the issue.
>>
>>  >"Definitely not" may be technically correct, but
>>  >I think a bit more context is needed here.  The
>>  >TAG Architecture document says:
>>  >
>>  >"It is conventional on the hypertext Web to
>>  >describe Web pages, images, product catalogs,
>>  >etc. as ³resources². The distinguishing
>>  >characteristic of these resources is that all of
>>  >their essential characteristics can be conveyed
>>  >in a message. We identify this set as
>>  >³information resources.²
>>  >
>>  >This document is an example of an information
>>  >resource. It consists of words and punctuation
>>  >symbols and graphics and other artifacts that
>>  >can be encoded, with varying degrees of
>>  >fidelity, into a sequence of bits. There is
>>  >nothing about the essential information content
>>  >of this document that cannot in principle be
>>  >transfered in a message. In the case of this
>>  >document, the message payload is the
>>  >representation of this document."
>>
>>  OK, reading the above carefully, in the light of
>>  David's comment, I seem to discern an implicit
>>  distinction between several things. Let me be
>>  excruciatingly pedantic here for a second, and
>>  make very, very careful distinctions between
>>  several things involved in a hypothetical HTTP
>>  GET, which to keep things as simple as possible I
>>  will assume is the successful getting of an XHTML
>>  web page from a server, with a 2xx code, no
>>  problems. There seem to be several entities
>>  involved in this.
>>
>>  1.  An "HTTP endpoint", which is a computational
>>  process running on hardware, which processes the
>>  GET request and emits http codes and bit-strings.
>
>Yes.  This is the "information resource", defined in an operational style.
>
>Side note: I actually used the term "logical 
>HTTP endpoint", not just "HTTP endpoint", 
>because an "information resource" is associated 
>with an entire URL minus the fragment 
>identifier, whereas a Web server is (normally?) 
>associated with only the domain+server part 
>(ignoring the path and query string parts).  For 
>example, given the URI 
>http://example.org/foo?bar#fum , 
>http://example.org/ , http://example.org/foo and 
>http://example.og?bar may all correspond to 
>different "information resources", even though 
>they would be served by the same Web server 
>associated with example.org.

Yes. I really don't care about distinctions like 
this for the present discussion. But still, I 
take it, by 'logical endpoint', you do mean to 
refer to some kind of computational process 
running on a machine, however this is 
conceptualized. If not, I am unable to follow 
your meaning; if so, then this is not the same 
kind of thing as a document, a piece of XML, or a 
bit- or byte-stream.

>  >
>>  2. The sequence of bits or bytes whose
>>  transmission from (1) constitutes the successful
>>  completion of the GET request.
>
>I'm not sure what you mean here.  If you're 
>including the bits that are part of the HTTP 
>protocol handshake itself, then it would be more 
>than just the "representation".  If not, then I 
>don't know how you mean #2 to be different from 
>#4 below.

As to the first point, I would like to know how 
you distinguish those parts of the transmission 
which you consider to be the 'representation' 
from those you do not, and say what it is about 
the former that makes them particularly 
'representational'. As to the second, the 
difference between #2 and #4 is that #2 is 
transmitted, while #4 resides on, or at, the 
network location described as #1. I have a mental 
picture here of #2 being pretty much a read-out 
or copy of #4.

>  >
>>  3. The Web page itself: a document, consisting of
>>  characters, which conform to XHTML syntactic
>>  rules.
>
>If it conforms to XHTML syntactic rules it 
>sounds like you are talking about a particular 
>instance of a document rather than a document in 
>the abstract sense (which may change over time)

No, a document does not change over time, in 
either the abstract or concrete sense. To refer 
to documents changing over time is simply an 
ontological error. There is nothing in the XML 
spec that refers to documents changing over time. 
Literary documents, legal documents and other 
documents do not change over time. In many cases, 
it is part of the very reason for having the 
document that it does not change over time. RDF 
graphs do not change over time. According to the 
TAG and REST, resources are defined to be able to 
change over time (more properly, to be functions 
from times to representations) but that does not 
imply that documents are resources: this is in 
fact one of the issues that we need to get clear. 
It seems that they cannot be, in fact, for this 
very reason: the only way to describe this 
situation coherently is to say that a resource 
can be a function from times to documents (the 
'version' at that time).

>, so this sounds to me like a "representation".

I cannot follow you here. Something is classified 
as a 'representation' simply by virtue of it not 
changing over time? This is the most 
extraordinary idea, and bears absolutely no 
relationship to the normal uses of this 
terminology. How can an abstract document be a 
representation of one of its own instances or 
tokens? This simply does not make sense: it would 
seem to make the representing relationship 
circular.

>  > 4. The encoding of the Web page (3) which is used
>>  by the process (1) to produce the bitsequence (2)
>
>This also sounds like the "representation", 
>though stated more specfically.  The WebArch 
>says: 'HTTP . . . uses the "Content-Type" and 
>"Content-Encoding" header fields to further 
>identify the format of the representation'.  See 
>http://www.w3.org/TR/webarch/#intro

The WebArch seems to use language incoherently. 
Part of my goal here is to try to disentangle 
what its authors intended to say. Citing it as 
authoritative is about as useful as quoting 
scripture to an atheist. I have no clear idea 
what the document means, in particular, by 
'representation', other than it is clearly not 
what the rest of the world means.

>  > 5. The encoding of the Web page (3) which is
>>  produced from the bit sequence (2) in the browser
>>  which issued the GET request and used by it
>>  render a visual form of the Web page (3) on the
>>  users's screen.
>
>This sounds like an internal browser-dependent 
>version of the "representation".

You refer to 'the representation' in the 
singular, which seems to indicate that some of 
these distinctions are irrelevant. Fair enough; 
but can you indicate which of my cases you would 
lump together as being (versions of ?) 'the' 
representation, and what you mean by a 'version'?

>  > and we could of course go further, distinguishing
>>  the image on the screen from its binary
>>  representation, the state of the process from the
>>  process itself, and so on (and on.)
>>
>>  Now, I tend to blur some of these distinctions,
>>  myself. For example, I tend to think of 2 through
>>  5 as simply being 'the Web page'; or if I am
>>  being more careful, to identify 2, 4 and 5 as
>>  'renderings' or 'encodings' or 'tokens' of the
>>  single, abstract, Web page (3). And I often don't
>>  bother to distinguish between 1 and 4. This gives
>>  a simplified picture, which is adequate for many
>>  purposes, in which we happily ignore the
>>  type/token distinction (as we normally do in
>>  English) and where issuing a GET is a bit like
>>  asking an usher for A concert program, at which
>>  she then hands you a copy from her pile of
>>  identical copies, and you take it away and read
>>  it without bothering her further, and if anyone
>>  asks you what you are reading you say, THE
>>  program. (You could say that each copy is a
>>  'representation' of the great concert program in
>>  the sky, or of all the other copies, or of the
>>  state of the printing platen at the moment the
>>  ink hit the paper, but there's not usually much
>>  point in being that picky about these
>>  distinctions.)
>
>True, but if one is discussing the TAG's WebArch 
>document ( at http://www.w3.org/TR/webarch/ ), 
>it is essential to make this distinction, 
>because the difference between a 
>"representation" and an "information resource" 
>is essential to the WebArch.

I repeat, the WebArch is incomprehensible. The 
point, for me, of this entire discussion is to 
try to make sense of it. I know that the WebArch 
makes this distinction between "representation" 
and "information resource", but it never defines 
either of these terms, so I have no idea WHAT 
distinction this is supposed to actually BE. To 
be told that an incomprehensible distinction is 
'essential' is not very much help.

>
>>  . . .
>>  Just to clarify another source of muddle, I would
>>  not call any of these things "representations" of
>>  any of the others. In my usage of the word
>>  "representation", there is no representation of
>>  anything involved in the entire architectural
>>  story of how an http GET is processed. Nothing
>>  represents anything here, because there are no
>>  semantic relationships involved. The various
>>  bitstrings are simply copies of one another, and
>>  the relationship of a document to its bitstring
>>  encoding is that of a rendering or encoding,
>>  rather than a representation: a token/type
>>  relationship. (The bitstring does not *describe*
>>  the document it encodes. If it did, it would have
>>  to describe it using a syntax, but bitstrings,
>>  pretty much by their very nature, do not have any
>>  syntax.)
>
>I agree.  I don't like the term "representation" 
>either, but I guess the TAG needed a term and 
>that was the term they picked.

Fine, provided that they gave some indication of 
what they intended it to mean, in this new 
technical sense. But they do not, and never have 
done. They did not explicate a notion and then 
say, we will call this "representation". They 
simply used the word, as though their readers 
shared in this usage, and refused to give any 
explication of what they mean. Your message 
continues in this unhelpful tradition, in fact.

>
>>  . . .
>>  An RDF ontology, at any rate, is either an RDF
>>  graph or an RDF/XML XML document. Either way, it
>>  is not an HTTP endpoint or an abstraction of an
>>  HTTP endpoint. So it cannot be an information
>>  resource in David's sense, seems to me.
>
>Yes, it can be if instances of it are intended to be served via HTTP.

No, I am sorry, it cannot. The fact is that an 
HTTP endpoint, given your answer above to my 
question, is not even in the same category as an 
RDF ontology: it not the same KIND of thing. So 
if an information resource is an HTTP endpoint, 
then it cannot possibly be an RDF ontology. If 
you want an RDF ontology to be an information 
resource, then you must change your definition. 
This has got nothing to do with the transfer 
protocol.

>  My proposed definition[1] is very narrow in at 
>least two ways: (1) it ignores "documents" that 
>are never intended to be served (because they 
>are not very relevant to the "information 
>resource"/"representation" discussion); and (2) 
>it is restricted to the HTTP protocol, because 
>that's where the issue of resource identity (and 
>the httpRange-14 issue) comes up.

Fine, I don't want to take issue with either of 
those restrictions. My point is more basic: 
running code at a network communication endpoint, 
on the one hand; and documents or ontologies, on 
the other, are simply not the same kind of thing. 
If an information resource is defined to be the 
former, then one of the latter can't be an 
information resource.

Pat


>
>[1] http://lists.w3.org/Archives/Public/public-swbp-wg/2006Apr/0053.html
>
>David Booth


-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Tuesday, 25 April 2006 16:12:42 UTC