University of Edinburgh comments on Architecture of the WWW from Henry S. Thompson on 2004-12-08 (public-webarch-comments@w3.org from December 2004)

From: Henry S. Thompson <ht@inf.ed.ac.uk>
Date: Wed, 08 Dec 2004 23:25:16 +0000
To: public-webarch-comments@w3.org
Message-ID: <f5b3byge5k3.fsf@erasmus.inf.ed.ac.uk>
These comments underlie our response of "be published . . . with minor
changes" but "do _not_ [publish] without significant revision" to the
WebArch Call for Review.  That is, the document _should_ be published,
without reverting back down the Recommendation track, but only after
some non-trivial improvements have been made.

General Structural Comments:

1) We would urge the TAG to keep in mind that this document is not
just an engineering specification but something approaching a sort of
"philosophy of the web". As such, although we appreciate the fact that
philosophical terminology that obviously would be useful
(intentionality/extentionality) has been left out of the document to
avoid possible philosophical "rat-holes", none-the-less attempting to
produce a document that makes statements that go beyond the Web per se
(as for instance in the definition of resource) without any reference to
philosophy runs the risk of engendering more confusion than it avoids.

2) While the "Glossary" at the end is very useful, in general the
primary terms and their relationships are used often without being
defined. For example, the picture of the Oaxaca weather example is
critical, yet the term "Representation" is defined not defined until
section 3.2. It would be useful if the primary terms were defined
together at the beginning of the document, indeed perhaps in
conjunction with the weather report example.

3) Often the document lacks precision as to whether it is about the
Web or the Internet as a whole. For example, the
URI->resource->representation conceptual map as a foundation of the
Web may work fine, but as a model for the entire Internet it is
doubtful. For example, when one uses the ftp URI scheme, one by nature
gets a file, which would be an information resource, which simplifies
much of the architecture.  The mailto and news URI schemes, on the
other hand, don't in general support retrieval at all.  Much of the
document, including issues such as content negotiation, does not apply
to many URI schemes.  It would be good if sections like 3.2.1 and
3.2.2 made clear that fragment identifiers and content negotiation are
only relevant to the http URI scheme and a few close relatives
thereof, not all URI schemes in general.

6) While the Semantic Web may in due course become a profoundly
signifcant use of the Web, it is only just emerging from the research
prototype phase., It is unclear how much our current imperfect
expectations of how it will prosper should be enshrined in the
Architecture of the WWW. In particular, references to "owl:sameAs" and
"inverseFunctionalProperty" in finding out whether two URIs identify
the same resource seems at best premature.

7) Since the Web is defined as an "information space", one type of
resource is an "information resource", and a representation is defined
as "data that encodes information about resource state", it is clear
that the notion of "information" is fundamental to this document. But
exactly what the document means by it is not at all clear. There are
multiple notions of information.  Shannon's theory of information is a
theory of encoding information with no regard to content, Dretske has
elaborated a theory of information that deals with content, and
Kolmogorov has a theory of information related to complexity. Given
that there are at least three and possibly more notions of information
that could be being assumed, with distinct underlying assumptions,
either the document should be more explicit, even if only to
acknowledge that the term is underspecified.  Adding a definition of
information or at least a reference in the glossary would help.

Specific Comments:

1) Section 2: In the definition of "information resource", there is
an appeal to the concept of "message", which is defined as "a
unit of communication between agents". We assume this is an attempt to root
the concept of information in Shannon's theory of information. This appeal
is in the following sentence:

 "The distinguishing characteristic of these resources is that all of
  their essential characteristics can be conveyed in a message".

Then in the next paragraph it says "There is nothing about the
essential information content of this document that cannot in
principle be transfered in a representation." First, the word
"representation" has yet to be defined.  Did you mean "There is
nothing about the essential information content of this document that
cannot in principle be transfered in a *message*"? If so, is there a
real difference between a message and a representation? Are all
representations messages, or only some, and vice versa? Also, it is
strange that there is no definition of the complement of set of
information resources.

2) Section 3.2: The definition of representation is very problematic. It
is first described as "A Representation is data that encodes information
about resource state.", which one would think would mean that there is
some implicit connection between a representation and a resource. Then,
the next sentence makes the statement less clear:

 "Representations do not necessarily describe the resource, or
  portray a likeness of the resource, or represent the resource in
  other senses of the word "represent".

Does this mean there is no connection between a resource and its
representation, so that if we serve a URI which we as the URI owner say
identifies "green cheese" and we serve a representation about "the moon",
am we correct or incorrect in doing so?

We would like to see a "Good Practice" statement saying that "Within
reason, a representation of a resource should make it clear to the
agent that what the resource is". This is implicit in much of the rest
of the document, such as the sentence in 3.1:

 "Assuming that a representation has been successfully retrieved, the
  expressive power of the representation's format will affect how
  precisely the representation provider communicates resource state."

Alternately, one could  simply delete the second sentence in 3.2 or
clarify what it means.

3) Section 3.1.1: If a resource can be anything, then as stated the
determiner of what resource a representation is about is determined by
its owner, not the users or readers of its content. Yet, this leads to
some interesting problems in the description of URI collision:

  "If the representation communicates the state of the resource
  inaccurately, this inaccuracy or ambiguity may lead to confusion
  about what the resource is.  If different users reach different
  conclusions about what the resource is, this may lead to URI
  collision".

Does this mean that now the users of the representation of the
resource, as opposed to the owner of the URI, now determines the
resource? This sentence should either be removed or refactored.

4) The distinction between "indirectly" identify using a URI and
"directly" identify using a URI is often unclear in URI collisions. First,
the issue in point 3) above needs to be resolved. The problem is
introduced by this sentence:

 "To say that the URI "mailto:nadia@example.com" identifies both an
  Internet mailbox and Nadia, the person, introduces a URI collision.
  However, we can use the URI to indirectly identify Nadia.
  Identifiers are commonly used in this way.

 "Local policy establishes what they indirectly identify.  Suppose
  that nadia@example.com is Nadia's email address. The organizers of a
  conference Nadia attends might use "mailto:nadia@example.com" to
  refer indirectly to her (e.g., by using the URI as a database key in
  their database of conference participants).  This does not introduce
  a URI collision."

We don't see how this does *not* introduce a URI collision if the
owner of the URI determines what a resource a URI determines. This
paragraph seems to say that "locally" a user can determine the
resource, but globally only the "owner" can. Where does the boundary
of "local policy" begin, and "global" end?  We would like to see a
statement that the use of a URI as an indirect identification is
not "good practice", and a "Good Practice" that says that "If you
are going to use the same URI to directly identify one resource and
indirectly identify another, it is good practice to create two
separate URIs for the two separate resources". To do otherwise
undermines many other good practices mentioned in this document.

5) Fragment Identifier semantics: Since fragment identifiers are dealt
with purely on the client side, and the owner of a URI determines what a
resource is about, why is the following added to the definition of a
fragment identifiers:

 "The secondary resource may be some portion or subset of the primary
 resource, some view on representations of the primary resource, or
 some other resource defined or described by those representations."

The second clause seems consistent with how fragment identifiers are
used as defined by http, but the third seems not only inconsistent but
to give the entire definition of secondary resource no meaning: a
secondary resource is just another resource entirely, as said in the
next paragraph.

The definition of fragment identifiers in terms of resources as
opposed to just representations seems bizarre given the wording of
Section 3.2.1, which is much more accurate as regards how fragment
identifiers are actually used on the Web. It would clarify things
immensely with no lost of power (except in the extreme case of some
current doubtful Semantic Web conventions, as detailed in Point 6) to
just delete the entire section 2.6 and replace it with something more
in line with section 3.2.1 or reference section 3.2.1.

6) In general, it seems like points 2-5 above identify problems which
arise from trying to incorporate the current somewhat unclear and only
just emerging consensus as to the expected Architecture of the
_Semantic_ Web into this document, something which should only be done
once the Semantic Web has more fully emerged and stabilised.  First,
often in Semantic Web practice a resource is used as a name, without
any representation being needed. However, this document states clearly
this is not good practice. Second, in the Semantic Web the direct
vs. indirect identification issue is sometimes dealt with by adding
hashes to the end of a URI. So, "www.cogsci.ed.ac.uk/~ht" identifies
Henry Thompson's web-page, while "www.cogsci.ed.ac.uk/~ht#" identifies
Henry Thompson the person. There has been a long-standing argument
about this, and http://rdfweb.org/topic/HashSlashIssue provides some
interesting links.  In our opinion the use of hash for indirect
identification should be considered not good practice in addition to
the use of URIs to denote resources without representations. Lastly,
it should be good practice that the representation of a resource
clearly communicates the intent of the owner of the URI about what
resource the URI identifies to the user of the representation.

Minor point:

Section 3.2: Why in "A Representation is data..." is 'representation'
capitalised? Other words are not capitalised when introduced.

Outstanding Dissents:

We believe our recommended changes will satisfactorily address the
legitimate concerns expressed by Stickler and Kopecky.  We believe the
change to 4.5.2 in the current draft adequately addresses the HTML
WG's concern.

Executive Summary:

1) Add definition of "information" to Glossary at end.  Suggest "Information
is a difficult concept, but as it stands information is anything that can
be encoded into a message that delivers knowledge to an agent".

2) Section 2.2: Change "There is nothing about the essential
information content of this document that cannot in principle be
transfered in a representation" to "There is nothing about the
essential information content of this document that cannot in
principle be transfered in a message". Add sentence, "In the case of
this document, the message is the representation of this document."

3) Section 3.1.1 Replace the sentence: "If the representation communicates
the state of the resource inaccurately, this inaccuracy or ambiguity may
lead to confusion about what the resource is. If different users reach
different conclusions about what the resource is, this may lead to URI
collision"
with
"If the representation communicates the state of the resource inaccurately,
this inaccuracy or ambiguity may lead to confusion about what the resource
is among users. If different users reach different conclusions about what
the resource is, they may interpret this as a URI collision."

4) Section 3.2: "A Representation is data..." should say "A representation
is data", thus making lowercase the word "representation".

5) Look long and hard at places where hostages to fortune are being
given with respect to the future trajectory of the Semantic Web
project.

6) Section 3.2 Add a Good Practice: "Interpretable Representations":
"Within reason, the representation  of a resource should make it clear to
the agent what the resource is."

7) Section 2.2.3: Add a Good Practice: "Avoid Indirect Identification": 
If one is going to use a single URI to directly identify one resource and
indirectly identify another resource, instead use two distinct URIs for
the two resources."  Properly understood we believe this firmly takes
a stance wrt the vexed httpRange-14 issue, and rules out at least for
the time being the 'semantic' overloading of empty fragment
identifiers (.../#).

Henry S. Thompson, Harry Halpin, University of Edinburgh


-- 
 Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
                     Half-time member of W3C Team
    2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
Received on Wednesday, 8 December 2004 23:25:21 UTC