review of Web Architecture document from Frank Manola on 2004-03-10 (public-webarch-comments@w3.org from March 2004)

From: Frank Manola <fmanola@acm.org>
Date: Wed, 10 Mar 2004 17:14:03 -0500
To: public-webarch-comments@w3.org
Cc: w3c-rdfcore-wg@w3.org
Message-ID: <404F932B.2050105@acm.org>
The following are some (personal) comments on 
http://www.w3.org/TR/2003/WD-webarch-20031209/
Sorry they're late.

--Frank

====

Section 1:  [[A travel scenario is used throughout this document to 
illustrate typical behavior of Web agents -- people or software (on 
behalf of a person, entity, or process) acting on this information space.]]

This sentence defines "Web agents" as including both people and software 
(as opposed to just software).  However, the usage of terms like 
"agent", "user agent", etc. throughout the document isn't always 
consistent with including people in addition to software in the 
definition (and sometimes, but not always, the phrase "software agent" 
is used explicitly in places where only software is clearly meant). 
Further comments will illustrate this point in specific instances.

====

[[This scenario illustrates the three architectural bases of the Web 
that are discussed in this document:

    1. Identification. Each resource is identified by a URI. In this 
travel scenario, the resource is about the weather in Oaxaca and the URI 
is "http://weather.example.com/oaxaca".]]

It would be more consistent with the rest of the example if "the 
resource is about the weather in Oaxaca" read "the resource is *a 
report* about the weather in Oaxaca."

Further, given the potential generality of the things URIs can identify, 
it would be helpful if there were some discussion somewhere about 
distinguishing between URIs identifying such distinct things as:

a. a report about the weather in Oaxaca
b. another report about the weather in Oaxaca produced by a second 
organization
c. "the weather in Oaxaca" (which no one "owns", but which may be 
described in multiple reports created (and owned) by multiple 
independent parties)
d. data on the weather in Oaxaca reported by on-line weather 
instrumentation which may be accessible to anyone, and which the 
creators of (a) and (b) combine with additional information (satellite 
photos, radar sweeps) to produce the reports (a) and (b).

More on this later.

====

Section 1.1:  [[    * The addition a conformance section is not likely 
to increase the utility of the document.
]]

The addition *of* a conformance section.

====

Section 1.1.2, first para:  [[The section on Architectural 
Specifications includes references.]]

This sentence seems to end abruptly.  References to what?

====

Section 1.1.3:  [[Authors of protocol specifications in particular 
should invest time in understanding the REST model and consider the role 
to which of its principles could guide their design: statelessness, 
clear assignment of roles to parties, uniform address space, and a 
limited, uniform set of verbs.]]

This sentence has an "interesting" structure.  For one thing, "the role 
to which of its principles could guide their design" seems to mix 
several more usual constructions, e.g., either "the role its principles 
could [or should] *play* in their designs" or "the *extent* to which 
each of its principles could [or should] guide their designs".  For 
another, it seems as if the list of principles should follow 
"principles" rather than "design", as in something like:  "Authors of 
protocol specifications in particular should invest time in 
understanding the REST model and consider the role its principles -- 
statelessness, clear assignment of roles to parties, uniform address 
space, and a limited, uniform set of verbs -- could play in their designs.

====

Section 1.2.1, second para:  [[...The fact, for example, that the an 
image can be identified...]]

"an image" rather than "the an image"

====

[[...PNG and SVG to evolve independent of...]]

independently of

====

Section 1.2.2, second last para:  [[For example, from early on in the 
Web, HTML agents followed the convention of ignoring unknown elements.]]

"from early on in the Web" seems a little clumsy.  How about "For 
example, HTML agents have historically followed..."?

====

Section 1.2.3:  [[A user agent acts on behalf of the user and therefore 
is expected to help the user understand the nature of errors, and 
possibly overcome them. User agents that correct errors without the 
consent of the user are not acting on the user's behalf.]]

Is "user agent" intended to be *any* kind of "agent" (human or software, 
as previously defined) acting on behalf of someone else (the "user", so 
far undefined), or just *software* that acts on behalf of a human?

Also, the text seems to equate "act on behalf of the user" with that 
action necessarily being helpful, which is not necessarily the way "act 
on behalf of" is always interpreted.  The real point would seem to be 
that user agents that correct errors in this way may in some sense be 
acting on the user's behalf, but they aren't helping the user by doing it.

====

Same section, third bullet:  [[* An agent that encounters unrecognized 
content...]]

Given the context, this seems a bit ambiguous, since it might be taken 
to refer to "user agent", as well as more generally to "agent" (assuming 
these are different;  are they?)

====

Section 2:  First para:  [[Parties who wish to communicate must agree 
upon a shared set of identifiers and on their meanings.]]

"identifiers (names for things)"?

====

Second para:  [[It follows that a resource should be assigned a URI if a 
third party might reasonably want to link to it...]]

Why "a third party" (what are the first two parties)?

====

Third para:  [[Resources exist before URIs...]]

This sounds like it should be "Resources exist independently of URIs".

====

[[Designers should expect that it will prove useful to be able to share 
a URI across applications, even if that utility is not initially evident.]]

Why is the reference here to "designers"?  Designers of what?  This 
sounds like it should instead be "resource owners", as mentioned in the 
principle "URI assignment" just below.

However, there is also a related issue, which is that the terms 
"resource owner", "URI owner", and "URI producer" are all used in 
further discussion.  I can imagine situations in which these (or at 
least the first two) might have distinct meanings, but they don't 
necessarily seem to be used that way.  If they are distinct, it would 
help to have precise definitions.  If they aren't distinct, it would 
help to pick one term and use it consistently (along with some further 
explanation of why they are the same).

In particular, "Resource owner" and "URI owner" would appear to be 
equivalent when discussing URIs that refer to resources with retrievable 
representations.  However, given that URIs can be created to refer to 
other kinds of resources, it would seem that multiple URIs might be 
created to refer to the same resource, and those URIs would have 
different owners.  For example, suppose I (the person Frank Manola) am 
the resource.  It seems to me I can reasonably claim to be the owner of 
that resource (whether I have assigned myself a URI or not;  recall that 
resources can have zero URIs).  Independently, other people may create 
resources with retrievable representations (e.g., reports) that refer to 
me and, perhaps not knowing the URI I have assigned to myself (even if I 
*have* assigned one), can create URIs to refer to me (say, in RDF 
statements). It seems to me those other people can reasonably claim to 
be the owners of those latter URIs (e.g., they determine that those URIs 
denote me), even though they don't own the resource the URIs identify 
(me).  Moreover, these other people (more effectively) can create URIs 
for the resources with retrievable representations (reports referring 
to, among other things, Frank Manola), and those resources (the reports) 
are distinct from the resource Frank Manola.  In this case, it seems to 
me that those other people are the owners of both the resources (the 
reports) and the URIs that identify them.

====

[[Principle:  URI assignment:  A resource owner SHOULD assign a URI to 
each resource that others will expect to refer to.]]

This seems as if it should read "A resource owner SHOULD assign a URI to 
each resource that the owner expects others will want to refer to." 
(How can others expect to refer to resources they don't necessarily know 
about?)

====

Section 2.1: [[URI producers should be conservative about the number of 
different URIs they produce for the same resource. For example, the 
parties responsible for weather.example.com should not use both 
"http://weather.example.com/Oaxaca" and 
"http://weather.example.com/oaxaca" to refer to the same resource; 
agents will not detect the equivalence relationship by following 
specifications. On the other hand, there may be good reasons for 
creating similar-looking URIs. For instance, one might reasonably create 
URIs that begin with "http://www.example.com/tempo" and 
"http://www.example.com/tiempo" to provide access to resources by users 
who speak Italian and Spanish.]]

Why does the first sentence refer to "URI producers" that "produce" URIs 
rather than "resource owners" that "create" them (which would be more 
consistent with earlier text).  I also note that words "assign", 
"create", and "produce" (and possibly others) are all used for what 
seems to be the same idea.

Also, the rest of this illustration seems to have a funny interaction 
with the URI opacity principle in Section 2.5 (especially the discussion 
there about the travel example), since the Section 2.1 text above seems 
to suggest there is value in being able to convey information to an 
accessing "agent" (a human in this case) via the form of the URI itself 
(i.e., if URIs are to be totally opaque to the "agent", why would there 
be value in using one language over another?).  Of course, this may be 
just another problem in allowing "agent" to refer to people.  However, 
the problem seems somewhat more acute if the result of dereferencing 
URIs in different languages is the retrieval of the report in the 
corresponding languages because, while this kind of makes sense, it also 
invites determining the language of the report from the language of the URI.

====

Section 2.2:  [[The requirement for URIs to be unambiguous demands that 
different agents do not assign the same URI...]]

Now we have *agents* assigning URIs rather than, e.g., resource or URI 
owners.  It's not clear that this is consistent with prior discussion.

====

[[The concept of URI ownership is especially visible in the case of the 
HTTP protocol, which enables the URI owner to serve authoritative 
representations of a resource.]]

This text is pertinent to the point raised earlier about resource vs. 
URI ownership, and might be expanded on a bit to clarify that 
relationship.  In particular, when dealing with URIs that have 
retrievable representations, it is straightforward to demonstrate 
ownership;  non-owners can't determine what is returned when 
dereferencing such URIs, while owners can.


====

Section 2.3:  [[URI ambiguity should not be confused with ambiguity in 
natural language. The English statement "'http://www.example.com/moby' 
identifies 'Moby Dick'" is ambiguous because one could understand the 
phrase "Moby Dick" to refer to distinct resources: a particular printing 
of this work, or the work itself in an abstract sense, or the fictional 
white whale, or a particular copy of the book on the shelves of a 
library (via the Web interface of the library's online catalog), or the 
record in the library's electronic catalog which contains the metadata 
about the work, or the Gutenberg project's online version]]

This example illustrates an ambiguous natural language statement, but 
it's not clear that it doesn't also illustrate an ambiguous URI, since 
the text doesn't say anything about how example.org, or other parties 
citing http://www.example.com/moby, actually intepret it.

====

Section 2.3.1:  [[In Web architecture, URIs identify resources. Outside 
the bounds of Web architecture specifications, URIs can be useful for 
other purposes, for example, as database keys...]]

It seems to me this paragraph mixes a few things.  Just because a URI is 
used as a database key doesn't necessarily mean it's being used for a 
different purpose.  If a URI is used as a key in a relational table that 
associates metadata with the Web resources identified by those keys, and 
does so correctly (i.e., distinguishes between metadata about Nadia and 
metadata about her mailbox), it seems as if this is the *same* use of 
the URI (to identify a Web resource), even though it may also be used in 
the database to identify a distinct row in the table.  Moreover, the 
database might exhibit URI ambiguity in the same way the Web might, 
e.g., by mixing metadata about both Nadia and her mailbox in the same 
row.  At the same time, the use of "mailto:nadia@example.com" as an 
identifier for Nadia rather than her mailbox seems just as likely to 
occur in a Web context as in this database one (people seem to want to 
do it in RDF, for example;  or is this not the part of the Web you're 
talking about?).

Also, in the same paragraph:  [[URI ambiguity arises a URI is used to 
identify two different Web resources.]]

URI ambiguity arises *when* a URI...

====

Section 2.4:  [[Because of these costs, if a URI scheme exists that 
meets the needs of an application, designers should use it rather than 
invent one.]]

This sentence refers to "designers", but the "Good Practice" point below 
refers to "authors of specifications".  Shouldn't the same terms be used 
in both places?

====

Section 2.5:  [[It is tempting to guess the nature of a resource by 
inspection of a URI that identifies it. However, the Web is designed so 
that agents communicate resource state through representations, not 
identifiers.]]

This is another place where including people in the definition of 
"agents" seems to create a possible difficulty.  If agents include 
people, then people quite frequently communicate information about the 
nature of a resource by inspection of URIs, and it's very helpful.  For 
example, "http://weather.example.com/oaxaca" certainly suggests that the 
resource it identifies has something to do with the weather in oaxaca 
(as is noted further on), and that's very useful information (e.g., when 
people pass those URIs around).  That's certainly information about "the 
nature of a resource", and Internet Media Types aren't the only things 
relevant to people.  This all, of course, reads much better if "agents" 
are restricted to software.  Pursuing this point in the subsequent text:

[[Agents making use of URIs MUST NOT attempt to infer properties of the 
referenced resource except as licensed by relevant specifications.]]

This is good practice for software "agents".  For people "agents", given 
the "must not", how do you propose to stop them?

Further to this point, the text goes on:

[[The example URI used in the travel scenario 
("http://weather.example.com/oaxaca") suggests that the identified 
resource has something to do with the weather in Oaxaca. A site 
reporting the weather in Oaxaca could just as easily be identified by 
the URI "http://vjc.example.com/315". And the URI 
"http://weather.example.com/vancouver" might identify the resource "my 
photo album."]]

This is certainly true.  But while it's good practice for software to 
treat URIs opaquely, it seems to me that given the discussion in Section 
2.1, which seems to license creating "descriptive" URIs in different 
languages to enable people speaking those languages to more easily 
access a resource (and which reflects the use of text in URIs as a means 
for conveying information to people), you might want to suggest that, 
given this "dual purpose" of URIs, it's *not* good practice to use the 
URI "http://weather.example.com/vancouver" to identify the resource "my 
photo album", even though one could, and it would be irrelevant to software.


====

Section 2.7.2:  Reference [RDF10] cites the RDF M&S Recommendation, 
rather than the new Recommendation set, and should be updated (the OWL 
reference should be updated as well).  If *one* of the new RDF documents 
is to be cited, I would suggest Concepts.

====

Section 3:  First sentence:
[[Communication between agents over a network about resources...]]

Too many qualifying phrases here (e.g., "about resources" presumably 
qualifies "Communication", rather than "a network", but it's kind of 
distant...)

====

Section 3.3.1:  [[Note that one can use a URI with a fragment identifier 
even if one does not have a representation available for interpreting 
the fragment identifier (one can compare two such URIs, for example). 
Parties that draw conclusions about the interpretation of a fragment 
identifier without retrieving a representation do so at their own risk; 
such interpretations are not authoritative.]]

This is a place where some qualifying context about the nature of the 
Web to which this architecture applies would have been helpful.  For 
example, suppose I have a collection of RDF or OWL statements having as 
subjects the URI "http://www.example.com/images/nadia#hat", and the 
RDF/OWL statements assert that the subject is of class "Hat" in some 
ontology, that it's blue, and so on.  On one hand, it seems as if one 
could reasonably draw conclusions about the interpretation of this 
fragment identifier (or rather the whole URI including it) *from the 
RDF/OWL* without dereferencing the URI (using the URI to retrieve a 
representation, whose media type specifies the authoritative 
interpretation), assuming that the RDF/OWL itself is from a sufficiently 
"authoritative" (in some sense) representation somewhere.  Saying "such 
intepretations are not authoritative" without any further qualification 
or discussion, while it makes perfect sense given the way the Web works 
now, doesn't seem to take such additional usage (which, after all, is 
described in W3C Recommendations) into account.


====

Section 3.3.2:  It would be helpful if the text following the "story" 
explicitly answered the question posed in the "story".  For example, if 
the idea here is that the fragment should always identify Nadia's hat in 
any graphic representation provided by dereferencing the URI containing 
the fragment, it would help to say that.

====

Section 3.4: First para:  [[To give these parties the confidence that 
they are all talking about the same thing when they refer to "the 
resource identified by the following URI ..." the design choice for the 
Web is, in general, that the owner of a resource assigns the 
authoritative interpretation of representations of the resource.]]

The text "owner of a resource" links to Section 2.2 titled "URI 
Ownership".  So why say "owner of a resource" rather than "owner of a 
URI"?  Also, Section 3.3 just got through telling us that if the URI 
contains a fragment identifier, then the Internet Media Type of the 
retrieved representation specifies the authoritative interpretation of 
the fragment identifier.  I realize that in one case it's the 
authoritative interpretation *of the fragment* and in the other its the 
authoritative interpretation *of representations of the resource*, but 
the use of "authoritative interpretation" in both places (particularly 
when they're so close together) seems potentially confusing.

====

Section 3.4.1: [[User agents should detect such inconsistencies but 
should not resolve them without involving the user.]]

Now the term is "user agent" rather than "agents".  Is there some 
particular reason for distinguishing between these terms?

====

Section 3.5:  [[Nadia's retrieval of weather information (an example of 
a read-only query or lookup) qualifies as a "safe" interaction; a safe 
interaction is one where the agent does not incur any obligation beyond 
the interaction. An agent may incur an obligation through other means 
(such as by signing a contract). If an agent does not have an obligation 
before a safe interaction, it does not have that obligation afterwards]]

Here, "agent" is used in a sense where it might well be a person 
("signing a contract").  Can software agents "incur obligations" in the 
sense used here?

====

[[Other Web interactions resemble orders more than queries.]]

Is this "orders" in the sense of "placing an order", "that's an order, 
soldier", or both?

====

[[Principle:  Safe retrieval
Agents do not incur obligations by retrieving a representation.]]

Shouldn't this be "should not" rather than "do not"?  "Do not" suggests 
that it doesn't happen, rather than that it's incorrect if it does (as 
described in the next sentence).

====

Section 3.6:  Following the story appears:
[[The usefulness of a resource depends on good management by its owner. 
As is the case with many human interactions, confident interactions with 
a resource depend on stability and predictability. The value of a URI 
increases with the predictability of interactions using that URI. 
Avoiding unnecessary URI aliases is one aspect of proper resource 
management.]]

While the last sentence above certainly seems true, it's not clear what 
it has to do with the story, since the sentence refers to URI aliases, 
but the story describes multiple uses of the *same* URI returning wildly 
different results.  Is this supposed to refer to "URI ambiguity" 
instead?  It's also not clear that the problem illustrated in the story 
is necessarily "inconsistent representation".  Given the diagram in 
Section 1, which distinguishes the URI from the resource it identifies, 
it seems possible to distinguish (at least conceptually) between:

(a) returning pages about weather information and auto insurance as 
inconsistent representations of the same resource, and

(b) returning the same two pages as indicating that the URI owner has 
changed the resource the URI identifies

====

Section 3.6.3:  It might help clarify the point made in this section if 
some examples of mistaken attempts to restrict the use of URIs were 
given, rather than just the building security analogy.  Also, it's not 
clear whether or not the principle described here (and the further 
discussion in the "Deep Linking" finding) deals with all possible 
situations of this sort.  For example, it certainly used to be the case 
(and may be the case now) that US Defense Department documents could not 
only have a security classification, but their *titles* might also have 
a security classification (that is, the *existence* of the document was 
classified).  A classified document with an unclassified title could be 
referenced in the usual way, but a reader without the necessary 
clearance would be unable to access the referenced document (this would 
correspond to the situations already described).  On the other hand, 
classifying the title of the document would prevent the reader from even 
seeing the reference without the necessary clearance.  How would you 
suggest handling this situation (admittedly, opagueness of URIs would help!)

====

Section 4:  First para [[The first data format used on the Web was HTML. 
Since then, data formats have grown in number. ]]

The second sentence should probably say "...*Web* data formats..." 
(since data formats in general were growing well before HTML).

====

Second para: [[ Below we describe some characteristics of a data format 
make it easier to integrate into the Web architecture. ]]

"...*that* make it easier..."

====

Section 4.2.1:  [[There is typically a (long) transition period during 
which multiple versions of a format, protocol, or agent are 
simultaneously in use.]]

This is another passage that reads strangely if "agent" is considered to 
include people (I know I'm not the same person I once was, but multiple 
versions simultaneously in use?!).

====

[[Good practice: Version information
Format designers SHOULD provide for version information in language 
instances]]

What are "language instances"?

====

Section 4.2.2:  [[The policy sets expectations that the Working Group 
responsible for the namespace may modify it in any way until a certain 
point in the process ("Candidate Recommendation") at which point W3C 
constrains the set possible changes to the namespace in order to promote 
stable implementations.]]

"...constrains the set *of* possible changes..."

====

Section 4.2.3:  [[As part of defining an extensibility mechanism, a 
specification should set expectations about agent behavior in the face 
of unrecognized extensions.]]

The following good practice then says
[[Language designers SHOULD specify agent behavior in the face of 
unrecognized extensions.]]

It's not clear that a specification "setting expectations about" agent 
behavior is the same as it "specifying" it.  Why the difference in wording?

====

Section 4.2.4:  Third bullet [[    * RDF allows well-defined mixing of 
vocabularies, and allows text and XML to be used as a data type values 
within a statement having clearly defined semantics.
]]

"...allows text and XML to be used as data type values..." (delete the "a")?

Within the same statement?  What does "having clearly defined semantics" 
modify?  Should this be "...within statements having clearly defined 
semantics"?

====

Section 4.5.2:  [[XLink allows links to have multiple ends and to be 
expressed either inline or in "link bases" stored external to any or all 
of the resources identified by the links it contains.]]

This sentence seems a tad complicated.

====

Section 4.5.3:  [[How do the application designers ensure that there are 
no naming conflicts when they combine elements from different formats 
(for example, suppose that the "p" element is defined in two or more XML 
formats)?  "Namespaces in XML" [XMLNS] provides a mechanism for 
establishing a globally unique name that can be understood in any context.]]

I'd suggest rewriting this to avoid the rhetorical question.

Also, this may or may not be the right place to cite this example, but 
the way the XML Schema data types define a namespace for those types, 
allowing them to be used in RDF and OWL, might also be cited.  (This 
example illustrates the need sometimes to explicitly describe how URIs 
identifying language elements should be constructed using those 
namespace names).

====

Section 4.5.5:  The link for "rdfmsQnameUriMapping-6" needs a fragment 
to identify the specific issue.

====
Received on Wednesday, 10 March 2004 17:18:12 UTC