URIEquivalence-15: Identifier Scope and Resource Identity from Dan Connolly on 2002-09-06 (www-tag@w3.org from September 2002)

From: Dan Connolly <connolly@w3.org>
Date: 06 Sep 2002 16:16:26 -0500
To: www-tag@w3.org
Message-Id: <1031346987.21110.65.camel@dirk>
"Action DC: Redraft 2.2.1."
 -- http://www.w3.org/2002/08/30-tag-summary

I'm not crazy about the organization of 2.2;
I needed the principle of unambiguity established
before I could treat the issue of URIEquivalence.
This is designed to follow the current 2.1; it replaces
all of what's currently in 2.2.1 and steals
a little bit from what's in 2.2.2. I'm not entirely sure
what the impact on stuff after 2.2.1 is.

This uses "absolute URI reference",
per the department of redundancy department...


=========

2.2 Identifier Scope and Resource Identity

While local naming is in some ways less constraining, Web
architecture adopts global naming in order to optimize
the problem of determining that two parties[@@messages? documents?]
refer to the same resource:

  Absolute URI references are unambiguous: Each absolute URI
  reference unambiguously identifies one resource.

So if two parties use the same absolute URI reference, they refer to
the same resource.

 Note that they may use different relative abbrevations of the same
 absolute URI referece; for example, <tt>doc1</tt> and <tt>./doc1</tt>
 abbreviate the same absolute URI reference; if those references come
 from <tt>http://example/a/b</tt>, and another document uses the
 reference <tt>http://example/a/doc1</tt>, all three references
 identify the same resource.

This does not mean that parties that use different absolute URI
references necessarily refer to different resources.  Web architecture
does not constrain resources to be uniquely named. The problem of
determining whether two different absolute URI references refer to the
same resource or not is, in the general case, arbitrarily hard.

Fortunately, the problem does not need a complete nor ubiquitously
deployed solution in order for the Web to operate usefully. Approaches
to the problem include avoiding the problem, formal approaches,
and heuristic approaches:

  - One way to avoid the problem is to ignore the issue of resource
  identity altogether, and simply compare the identifiers
  themselves. In XPath, two QNames match if and only if the namespace
  names, as well as the local names, are identical.  The namespace
  name in xmlns="http://WWW.EXAMPLE/" is no more identical to the
  namespace name in xmlns="http://www.example/" or to the namespace
  name in xmlns="http://WWW.EX%41MPLE/" than it is to the namespace
  name in xmlns="mid:xyz@example".  Absolute URI references are
  strings; if any of the characters in the strings are different, the
  identifiers are different [@@cite charmod?].

  - HTTP avoids the problem more subtly, by not attempting to answer
  it: the HTTP protocol does not specify any way for a client to
  determine that two different absolute URI references identify the
  same resource, and it provides only a few cases[ftH] in which a
  client can determine that two different absolute URI references
  identify different resources.

  - Emerging Semantic Web technologies include DAML+OIL[@@cite] and
  OWL[@@cite] define RDF properties such as equivalentTo,
  FunctionalProperty, and such to state -- or at least claim --
  formally, that two absolute URI references identify the same
  resource. Whether such claims are to be trusted is a matter
  of local policy.

  - To decide wether to indicate to the user that a resource
  identified in a link is one that the user has already visited, web
  browsers use some inexpensive formal techniques and, since the cost
  of a false positive is fairly small, they use some inexpensive and
  fairly reliable heuristics. The HTTP specification formally
  specifies that case insensitivity of domain names carries thru:
  <tt>http://EXAMPLE/</tt> identifies the same resource that
  <tt>http://example/</tt> and <tt>http://example:80/</tt>
  identifiy. Informally, due to some widely deployed HTTP server
  filesystems and configuration defaults, it's quite likely that
  <tt>http://example/x</tt> and <tt>http://example/x/</tt> and perhaps
  even <tt>http://example/x/index.html</tt> identify the same
  resource, or at least: that they identify resources that are
  indistinguishable by the user.

"Be conservative in what you produce and liberal in what you
consume" is a maxim that applies, if somewhat counterintuitively,
to the problem of resource identification:

  - information providers should be conservative by maximizing the
  consistency of identifiers used to refer to any given resource, and
  by ensuring sufficient difference betweeen identifiers used for
  different resources:

  Good practice note: avoid aliases: if you want to refer to a
  resource and you know an absolute URI reference refers to it, you
  SHOULD use the same absolute URI reference. (2) Algorithms for
  computing absolute URI references from parameters should be
  deterministic. In particular, use lower-case letters when using %xx
  escaping to encode filenames, for fields, etc. in URIs.

  Good practice note: when assigning absolute URI references to
  identifiers to resources, do not rely on case to distinguish
  resources; that is: do not assign http://example/myStuff and
  http://example/MyStuff to distinct resources; while absolute URI
  references are specified to be case sensitive, many filesystems and
  other underlying technologies are not; hence they will not support
  the distinction.

  - consumers should be liberal allowing information providers
  maximum freedom in naming resources. Even though producers
  SHOULD NOT use MyStuff and myStuff to identify different
  resources, they MAY, and clients that assume they refer
  to the same resource do so at their own risk.



--- optional endnote

[ftH] Consider:
  1. C->S: GET /X
     S->C: 200 OK
           Last-Modified: 3p

  2. C->S: GET /Y
     S->C: 200 OK
           Last-Modified: 4p

  3. C->S: GET /X
           If-Modified-Since: 3p
     S->C: 200 OK
           Last-Modified: 5p

  4. C->S: GET /Y
           If-Modified-Since: 4p
     S->C: 304 Not Modified

After the 4th transaction, the client knows that /X refers to a
resource that was modified at 5pm, but /Y refers to a resource that
was not modified since 4pm; so they cannot refer to the same resource.

=========


It's probably longer than it needs to be; I've pretty much
exhausted my ability to work on it for the day, however.

I think perhaps it addresses
  http://www.w3.org/2001/tag/ilist#URIEquivalence-15
exept that it could use an explicit example of
escaped non-ascii stuff, ala Andre

http://www.w3.org/2000/10/rdf-tests/rdfcore/rdf-charmod-uris/test001.rdf
cited from
http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref

I wonder if we should deprecate/forbid the use
of %xx-escaped stuff that has an unescaped analog...
i.e. not go as far as saying that http://example/A%42C is
the same identifier as http://example/ABC , but to say
that the former is broken/deprecated/illegal, and
suggest that it's sorta reasonable for clients to
(a) halt and catch fire, or (b) notify the user and
change it into the latter as a means of error recovery.
The case of B vs %42 isn't as interesting as the
case of non-ascii characters, but I don't keep a
UTF-8 encoder in my head. The case of space, <, >,
and " are also kinda interesting.


-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Friday, 6 September 2002 17:16:26 UTC