RE: on documents and terms [was: RE: [WNET] new proposal WN URIs and related issues] from Booth, David (HP Software - Boston) on 2006-04-24 (public-swbp-wg@w3.org from April 2006)

From: Booth, David (HP Software - Boston) <dbooth@hp.com>
Date: Mon, 24 Apr 2006 18:38:04 -0400
To: "Pat Hayes" <phayes@ihmc.us>, <public-swbp-wg@w3.org>
Cc: "Frank Manola" <fmanola@acm.org>
Message-ID: <EBBD956B8A9002479B0C9CE9FE14A6C20B92FA@tayexc19.americas.cpqcorp.net>
> From: Pat Hayes [mailto:phayes@ihmc.us] 
> . . . 
> Not minor at all, Frank. This might get to one of the hearts 
> of the issue.
> 
> >"Definitely not" may be technically correct, but
> >I think a bit more context is needed here.  The 
> >TAG Architecture document says:
> >
> >"It is conventional on the hypertext Web to
> >describe Web pages, images, product catalogs, 
> >etc. as ³resources². The distinguishing 
> >characteristic of these resources is that all of 
> >their essential characteristics can be conveyed 
> >in a message. We identify this set as 
> >³information resources.²
> >
> >This document is an example of an information
> >resource. It consists of words and punctuation 
> >symbols and graphics and other artifacts that 
> >can be encoded, with varying degrees of 
> >fidelity, into a sequence of bits. There is 
> >nothing about the essential information content 
> >of this document that cannot in principle be 
> >transfered in a message. In the case of this 
> >document, the message payload is the 
> >representation of this document."
> 
> OK, reading the above carefully, in the light of 
> David's comment, I seem to discern an implicit 
> distinction between several things. Let me be 
> excruciatingly pedantic here for a second, and 
> make very, very careful distinctions between 
> several things involved in a hypothetical HTTP 
> GET, which to keep things as simple as possible I 
> will assume is the successful getting of an XHTML 
> web page from a server, with a 2xx code, no 
> problems. There seem to be several entities 
> involved in this.
> 
> 1.  An "HTTP endpoint", which is a computational 
> process running on hardware, which processes the 
> GET request and emits http codes and bit-strings.

Yes.  This is the "information resource", defined in an operational style.

Side note: I actually used the term "logical HTTP endpoint", not just "HTTP endpoint", because an "information resource" is associated with an entire URL minus the fragment identifier, whereas a Web server is (normally?) associated with only the domain+server part (ignoring the path and query string parts).  For example, given the URI http://example.org/foo?bar#fum , http://example.org/ , http://example.org/foo and http://example.og?bar may all correspond to different "information resources", even though they would be served by the same Web server associated with example.org.

> 
> 2. The sequence of bits or bytes whose 
> transmission from (1) constitutes the successful 
> completion of the GET request.

I'm not sure what you mean here.  If you're including the bits that are part of the HTTP protocol handshake itself, then it would be more than just the "representation".  If not, then I don't know how you mean #2 to be different from #4 below.

> 
> 3. The Web page itself: a document, consisting of 
> characters, which conform to XHTML syntactic 
> rules.

If it conforms to XHTML syntactic rules it sounds like you are talking about a particular instance of a document rather than a document in the abstract sense (which may change over time), so this sounds to me like a "representation".

> 
> 4. The encoding of the Web page (3) which is used 
> by the process (1) to produce the bitsequence (2)

This also sounds like the "representation", though stated more specfically.  The WebArch says: 'HTTP . . . uses the "Content-Type" and "Content-Encoding" header fields to further identify the format of the representation'.  See http://www.w3.org/TR/webarch/#intro
 
> 
> 5. The encoding of the Web page (3) which is 
> produced from the bit sequence (2) in the browser 
> which issued the GET request and used by it 
> render a visual form of the Web page (3) on the 
> users's screen.

This sounds like an internal browser-dependent version of the "representation".

> 
> and we could of course go further, distinguishing 
> the image on the screen from its binary 
> representation, the state of the process from the 
> process itself, and so on (and on.)
> 
> Now, I tend to blur some of these distinctions, 
> myself. For example, I tend to think of 2 through 
> 5 as simply being 'the Web page'; or if I am 
> being more careful, to identify 2, 4 and 5 as 
> 'renderings' or 'encodings' or 'tokens' of the 
> single, abstract, Web page (3). And I often don't 
> bother to distinguish between 1 and 4. This gives 
> a simplified picture, which is adequate for many 
> purposes, in which we happily ignore the 
> type/token distinction (as we normally do in 
> English) and where issuing a GET is a bit like 
> asking an usher for A concert program, at which 
> she then hands you a copy from her pile of 
> identical copies, and you take it away and read 
> it without bothering her further, and if anyone 
> asks you what you are reading you say, THE 
> program. (You could say that each copy is a 
> 'representation' of the great concert program in 
> the sky, or of all the other copies, or of the 
> state of the printing platen at the moment the 
> ink hit the paper, but there's not usually much 
> point in being that picky about these 
> distinctions.)

True, but if one is discussing the TAG's WebArch document ( at http://www.w3.org/TR/webarch/ ), it is essential to make this distinction, because the difference between a "representation" and an "information resource" is essential to the WebArch.

> . . .
> Just to clarify another source of muddle, I would 
> not call any of these things "representations" of 
> any of the others. In my usage of the word 
> "representation", there is no representation of 
> anything involved in the entire architectural 
> story of how an http GET is processed. Nothing 
> represents anything here, because there are no 
> semantic relationships involved. The various 
> bitstrings are simply copies of one another, and 
> the relationship of a document to its bitstring 
> encoding is that of a rendering or encoding, 
> rather than a representation: a token/type 
> relationship. (The bitstring does not *describe* 
> the document it encodes. If it did, it would have 
> to describe it using a syntax, but bitstrings, 
> pretty much by their very nature, do not have any 
> syntax.)

I agree.  I don't like the term "representation" either, but I guess the TAG needed a term and that was the term they picked.

> . . . 
> An RDF ontology, at any rate, is either an RDF 
> graph or an RDF/XML XML document. Either way, it 
> is not an HTTP endpoint or an abstraction of an 
> HTTP endpoint. So it cannot be an information 
> resource in David's sense, seems to me.

Yes, it can be if instances of it are intended to be served via HTTP.  My proposed definition[1] is very narrow in at least two ways: (1) it ignores "documents" that are never intended to be served (because they are not very relevant to the "information resource"/"representation" discussion); and (2) it is restricted to the HTTP protocol, because that's where the issue of resource identity (and the httpRange-14 issue) comes up.  

[1] http://lists.w3.org/Archives/Public/public-swbp-wg/2006Apr/0053.html

David Booth
Received on Monday, 24 April 2006 22:52:22 UTC