- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Wed, 08 Dec 2004 23:25:16 +0000
- To: public-webarch-comments@w3.org
These comments underlie our response of "be published . . . with minor changes" but "do _not_ [publish] without significant revision" to the WebArch Call for Review. That is, the document _should_ be published, without reverting back down the Recommendation track, but only after some non-trivial improvements have been made. General Structural Comments: 1) We would urge the TAG to keep in mind that this document is not just an engineering specification but something approaching a sort of "philosophy of the web". As such, although we appreciate the fact that philosophical terminology that obviously would be useful (intentionality/extentionality) has been left out of the document to avoid possible philosophical "rat-holes", none-the-less attempting to produce a document that makes statements that go beyond the Web per se (as for instance in the definition of resource) without any reference to philosophy runs the risk of engendering more confusion than it avoids. 2) While the "Glossary" at the end is very useful, in general the primary terms and their relationships are used often without being defined. For example, the picture of the Oaxaca weather example is critical, yet the term "Representation" is defined not defined until section 3.2. It would be useful if the primary terms were defined together at the beginning of the document, indeed perhaps in conjunction with the weather report example. 3) Often the document lacks precision as to whether it is about the Web or the Internet as a whole. For example, the URI->resource->representation conceptual map as a foundation of the Web may work fine, but as a model for the entire Internet it is doubtful. For example, when one uses the ftp URI scheme, one by nature gets a file, which would be an information resource, which simplifies much of the architecture. The mailto and news URI schemes, on the other hand, don't in general support retrieval at all. Much of the document, including issues such as content negotiation, does not apply to many URI schemes. It would be good if sections like 3.2.1 and 3.2.2 made clear that fragment identifiers and content negotiation are only relevant to the http URI scheme and a few close relatives thereof, not all URI schemes in general. 6) While the Semantic Web may in due course become a profoundly signifcant use of the Web, it is only just emerging from the research prototype phase., It is unclear how much our current imperfect expectations of how it will prosper should be enshrined in the Architecture of the WWW. In particular, references to "owl:sameAs" and "inverseFunctionalProperty" in finding out whether two URIs identify the same resource seems at best premature. 7) Since the Web is defined as an "information space", one type of resource is an "information resource", and a representation is defined as "data that encodes information about resource state", it is clear that the notion of "information" is fundamental to this document. But exactly what the document means by it is not at all clear. There are multiple notions of information. Shannon's theory of information is a theory of encoding information with no regard to content, Dretske has elaborated a theory of information that deals with content, and Kolmogorov has a theory of information related to complexity. Given that there are at least three and possibly more notions of information that could be being assumed, with distinct underlying assumptions, either the document should be more explicit, even if only to acknowledge that the term is underspecified. Adding a definition of information or at least a reference in the glossary would help. Specific Comments: 1) Section 2: In the definition of "information resource", there is an appeal to the concept of "message", which is defined as "a unit of communication between agents". We assume this is an attempt to root the concept of information in Shannon's theory of information. This appeal is in the following sentence: "The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message". Then in the next paragraph it says "There is nothing about the essential information content of this document that cannot in principle be transfered in a representation." First, the word "representation" has yet to be defined. Did you mean "There is nothing about the essential information content of this document that cannot in principle be transfered in a *message*"? If so, is there a real difference between a message and a representation? Are all representations messages, or only some, and vice versa? Also, it is strange that there is no definition of the complement of set of information resources. 2) Section 3.2: The definition of representation is very problematic. It is first described as "A Representation is data that encodes information about resource state.", which one would think would mean that there is some implicit connection between a representation and a resource. Then, the next sentence makes the statement less clear: "Representations do not necessarily describe the resource, or portray a likeness of the resource, or represent the resource in other senses of the word "represent". Does this mean there is no connection between a resource and its representation, so that if we serve a URI which we as the URI owner say identifies "green cheese" and we serve a representation about "the moon", am we correct or incorrect in doing so? We would like to see a "Good Practice" statement saying that "Within reason, a representation of a resource should make it clear to the agent that what the resource is". This is implicit in much of the rest of the document, such as the sentence in 3.1: "Assuming that a representation has been successfully retrieved, the expressive power of the representation's format will affect how precisely the representation provider communicates resource state." Alternately, one could simply delete the second sentence in 3.2 or clarify what it means. 3) Section 3.1.1: If a resource can be anything, then as stated the determiner of what resource a representation is about is determined by its owner, not the users or readers of its content. Yet, this leads to some interesting problems in the description of URI collision: "If the representation communicates the state of the resource inaccurately, this inaccuracy or ambiguity may lead to confusion about what the resource is. If different users reach different conclusions about what the resource is, this may lead to URI collision". Does this mean that now the users of the representation of the resource, as opposed to the owner of the URI, now determines the resource? This sentence should either be removed or refactored. 4) The distinction between "indirectly" identify using a URI and "directly" identify using a URI is often unclear in URI collisions. First, the issue in point 3) above needs to be resolved. The problem is introduced by this sentence: "To say that the URI "mailto:nadia@example.com" identifies both an Internet mailbox and Nadia, the person, introduces a URI collision. However, we can use the URI to indirectly identify Nadia. Identifiers are commonly used in this way. "Local policy establishes what they indirectly identify. Suppose that nadia@example.com is Nadia's email address. The organizers of a conference Nadia attends might use "mailto:nadia@example.com" to refer indirectly to her (e.g., by using the URI as a database key in their database of conference participants). This does not introduce a URI collision." We don't see how this does *not* introduce a URI collision if the owner of the URI determines what a resource a URI determines. This paragraph seems to say that "locally" a user can determine the resource, but globally only the "owner" can. Where does the boundary of "local policy" begin, and "global" end? We would like to see a statement that the use of a URI as an indirect identification is not "good practice", and a "Good Practice" that says that "If you are going to use the same URI to directly identify one resource and indirectly identify another, it is good practice to create two separate URIs for the two separate resources". To do otherwise undermines many other good practices mentioned in this document. 5) Fragment Identifier semantics: Since fragment identifiers are dealt with purely on the client side, and the owner of a URI determines what a resource is about, why is the following added to the definition of a fragment identifiers: "The secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations." The second clause seems consistent with how fragment identifiers are used as defined by http, but the third seems not only inconsistent but to give the entire definition of secondary resource no meaning: a secondary resource is just another resource entirely, as said in the next paragraph. The definition of fragment identifiers in terms of resources as opposed to just representations seems bizarre given the wording of Section 3.2.1, which is much more accurate as regards how fragment identifiers are actually used on the Web. It would clarify things immensely with no lost of power (except in the extreme case of some current doubtful Semantic Web conventions, as detailed in Point 6) to just delete the entire section 2.6 and replace it with something more in line with section 3.2.1 or reference section 3.2.1. 6) In general, it seems like points 2-5 above identify problems which arise from trying to incorporate the current somewhat unclear and only just emerging consensus as to the expected Architecture of the _Semantic_ Web into this document, something which should only be done once the Semantic Web has more fully emerged and stabilised. First, often in Semantic Web practice a resource is used as a name, without any representation being needed. However, this document states clearly this is not good practice. Second, in the Semantic Web the direct vs. indirect identification issue is sometimes dealt with by adding hashes to the end of a URI. So, "www.cogsci.ed.ac.uk/~ht" identifies Henry Thompson's web-page, while "www.cogsci.ed.ac.uk/~ht#" identifies Henry Thompson the person. There has been a long-standing argument about this, and http://rdfweb.org/topic/HashSlashIssue provides some interesting links. In our opinion the use of hash for indirect identification should be considered not good practice in addition to the use of URIs to denote resources without representations. Lastly, it should be good practice that the representation of a resource clearly communicates the intent of the owner of the URI about what resource the URI identifies to the user of the representation. Minor point: Section 3.2: Why in "A Representation is data..." is 'representation' capitalised? Other words are not capitalised when introduced. Outstanding Dissents: We believe our recommended changes will satisfactorily address the legitimate concerns expressed by Stickler and Kopecky. We believe the change to 4.5.2 in the current draft adequately addresses the HTML WG's concern. Executive Summary: 1) Add definition of "information" to Glossary at end. Suggest "Information is a difficult concept, but as it stands information is anything that can be encoded into a message that delivers knowledge to an agent". 2) Section 2.2: Change "There is nothing about the essential information content of this document that cannot in principle be transfered in a representation" to "There is nothing about the essential information content of this document that cannot in principle be transfered in a message". Add sentence, "In the case of this document, the message is the representation of this document." 3) Section 3.1.1 Replace the sentence: "If the representation communicates the state of the resource inaccurately, this inaccuracy or ambiguity may lead to confusion about what the resource is. If different users reach different conclusions about what the resource is, this may lead to URI collision" with "If the representation communicates the state of the resource inaccurately, this inaccuracy or ambiguity may lead to confusion about what the resource is among users. If different users reach different conclusions about what the resource is, they may interpret this as a URI collision." 4) Section 3.2: "A Representation is data..." should say "A representation is data", thus making lowercase the word "representation". 5) Look long and hard at places where hostages to fortune are being given with respect to the future trajectory of the Semantic Web project. 6) Section 3.2 Add a Good Practice: "Interpretable Representations": "Within reason, the representation of a resource should make it clear to the agent what the resource is." 7) Section 2.2.3: Add a Good Practice: "Avoid Indirect Identification": If one is going to use a single URI to directly identify one resource and indirectly identify another resource, instead use two distinct URIs for the two resources." Properly understood we believe this firmly takes a stance wrt the vexed httpRange-14 issue, and rules out at least for the time being the 'semantic' overloading of empty fragment identifiers (.../#). Henry S. Thompson, Harry Halpin, University of Edinburgh -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh Half-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail really from me _always_ has this .sig -- mail without it is forged spam]
Received on Wednesday, 8 December 2004 23:25:21 UTC