- From: Dan Connolly <connolly@w3.org>
- Date: 06 Sep 2002 16:16:26 -0500
- To: www-tag@w3.org
"Action DC: Redraft 2.2.1."
-- http://www.w3.org/2002/08/30-tag-summary
I'm not crazy about the organization of 2.2;
I needed the principle of unambiguity established
before I could treat the issue of URIEquivalence.
This is designed to follow the current 2.1; it replaces
all of what's currently in 2.2.1 and steals
a little bit from what's in 2.2.2. I'm not entirely sure
what the impact on stuff after 2.2.1 is.
This uses "absolute URI reference",
per the department of redundancy department...
=========
2.2 Identifier Scope and Resource Identity
While local naming is in some ways less constraining, Web
architecture adopts global naming in order to optimize
the problem of determining that two parties[@@messages? documents?]
refer to the same resource:
Absolute URI references are unambiguous: Each absolute URI
reference unambiguously identifies one resource.
So if two parties use the same absolute URI reference, they refer to
the same resource.
Note that they may use different relative abbrevations of the same
absolute URI referece; for example, <tt>doc1</tt> and <tt>./doc1</tt>
abbreviate the same absolute URI reference; if those references come
from <tt>http://example/a/b</tt>, and another document uses the
reference <tt>http://example/a/doc1</tt>, all three references
identify the same resource.
This does not mean that parties that use different absolute URI
references necessarily refer to different resources. Web architecture
does not constrain resources to be uniquely named. The problem of
determining whether two different absolute URI references refer to the
same resource or not is, in the general case, arbitrarily hard.
Fortunately, the problem does not need a complete nor ubiquitously
deployed solution in order for the Web to operate usefully. Approaches
to the problem include avoiding the problem, formal approaches,
and heuristic approaches:
- One way to avoid the problem is to ignore the issue of resource
identity altogether, and simply compare the identifiers
themselves. In XPath, two QNames match if and only if the namespace
names, as well as the local names, are identical. The namespace
name in xmlns="http://WWW.EXAMPLE/" is no more identical to the
namespace name in xmlns="http://www.example/" or to the namespace
name in xmlns="http://WWW.EX%41MPLE/" than it is to the namespace
name in xmlns="mid:xyz@example". Absolute URI references are
strings; if any of the characters in the strings are different, the
identifiers are different [@@cite charmod?].
- HTTP avoids the problem more subtly, by not attempting to answer
it: the HTTP protocol does not specify any way for a client to
determine that two different absolute URI references identify the
same resource, and it provides only a few cases[ftH] in which a
client can determine that two different absolute URI references
identify different resources.
- Emerging Semantic Web technologies include DAML+OIL[@@cite] and
OWL[@@cite] define RDF properties such as equivalentTo,
FunctionalProperty, and such to state -- or at least claim --
formally, that two absolute URI references identify the same
resource. Whether such claims are to be trusted is a matter
of local policy.
- To decide wether to indicate to the user that a resource
identified in a link is one that the user has already visited, web
browsers use some inexpensive formal techniques and, since the cost
of a false positive is fairly small, they use some inexpensive and
fairly reliable heuristics. The HTTP specification formally
specifies that case insensitivity of domain names carries thru:
<tt>http://EXAMPLE/</tt> identifies the same resource that
<tt>http://example/</tt> and <tt>http://example:80/</tt>
identifiy. Informally, due to some widely deployed HTTP server
filesystems and configuration defaults, it's quite likely that
<tt>http://example/x</tt> and <tt>http://example/x/</tt> and perhaps
even <tt>http://example/x/index.html</tt> identify the same
resource, or at least: that they identify resources that are
indistinguishable by the user.
"Be conservative in what you produce and liberal in what you
consume" is a maxim that applies, if somewhat counterintuitively,
to the problem of resource identification:
- information providers should be conservative by maximizing the
consistency of identifiers used to refer to any given resource, and
by ensuring sufficient difference betweeen identifiers used for
different resources:
Good practice note: avoid aliases: if you want to refer to a
resource and you know an absolute URI reference refers to it, you
SHOULD use the same absolute URI reference. (2) Algorithms for
computing absolute URI references from parameters should be
deterministic. In particular, use lower-case letters when using %xx
escaping to encode filenames, for fields, etc. in URIs.
Good practice note: when assigning absolute URI references to
identifiers to resources, do not rely on case to distinguish
resources; that is: do not assign http://example/myStuff and
http://example/MyStuff to distinct resources; while absolute URI
references are specified to be case sensitive, many filesystems and
other underlying technologies are not; hence they will not support
the distinction.
- consumers should be liberal allowing information providers
maximum freedom in naming resources. Even though producers
SHOULD NOT use MyStuff and myStuff to identify different
resources, they MAY, and clients that assume they refer
to the same resource do so at their own risk.
--- optional endnote
[ftH] Consider:
1. C->S: GET /X
S->C: 200 OK
Last-Modified: 3p
2. C->S: GET /Y
S->C: 200 OK
Last-Modified: 4p
3. C->S: GET /X
If-Modified-Since: 3p
S->C: 200 OK
Last-Modified: 5p
4. C->S: GET /Y
If-Modified-Since: 4p
S->C: 304 Not Modified
After the 4th transaction, the client knows that /X refers to a
resource that was modified at 5pm, but /Y refers to a resource that
was not modified since 4pm; so they cannot refer to the same resource.
=========
It's probably longer than it needs to be; I've pretty much
exhausted my ability to work on it for the day, however.
I think perhaps it addresses
http://www.w3.org/2001/tag/ilist#URIEquivalence-15
exept that it could use an explicit example of
escaped non-ascii stuff, ala Andre
http://www.w3.org/2000/10/rdf-tests/rdfcore/rdf-charmod-uris/test001.rdf
cited from
http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref
I wonder if we should deprecate/forbid the use
of %xx-escaped stuff that has an unescaped analog...
i.e. not go as far as saying that http://example/A%42C is
the same identifier as http://example/ABC , but to say
that the former is broken/deprecated/illegal, and
suggest that it's sorta reasonable for clients to
(a) halt and catch fire, or (b) notify the user and
change it into the latter as a means of error recovery.
The case of B vs %42 isn't as interesting as the
case of non-ascii characters, but I don't keep a
UTF-8 encoder in my head. The case of space, <, >,
and " are also kinda interesting.
--
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Friday, 6 September 2002 17:16:26 UTC