- From: Dan Connolly <connolly@w3.org>
- Date: 06 Sep 2002 16:16:26 -0500
- To: www-tag@w3.org
"Action DC: Redraft 2.2.1." -- http://www.w3.org/2002/08/30-tag-summary I'm not crazy about the organization of 2.2; I needed the principle of unambiguity established before I could treat the issue of URIEquivalence. This is designed to follow the current 2.1; it replaces all of what's currently in 2.2.1 and steals a little bit from what's in 2.2.2. I'm not entirely sure what the impact on stuff after 2.2.1 is. This uses "absolute URI reference", per the department of redundancy department... ========= 2.2 Identifier Scope and Resource Identity While local naming is in some ways less constraining, Web architecture adopts global naming in order to optimize the problem of determining that two parties[@@messages? documents?] refer to the same resource: Absolute URI references are unambiguous: Each absolute URI reference unambiguously identifies one resource. So if two parties use the same absolute URI reference, they refer to the same resource. Note that they may use different relative abbrevations of the same absolute URI referece; for example, <tt>doc1</tt> and <tt>./doc1</tt> abbreviate the same absolute URI reference; if those references come from <tt>http://example/a/b</tt>, and another document uses the reference <tt>http://example/a/doc1</tt>, all three references identify the same resource. This does not mean that parties that use different absolute URI references necessarily refer to different resources. Web architecture does not constrain resources to be uniquely named. The problem of determining whether two different absolute URI references refer to the same resource or not is, in the general case, arbitrarily hard. Fortunately, the problem does not need a complete nor ubiquitously deployed solution in order for the Web to operate usefully. Approaches to the problem include avoiding the problem, formal approaches, and heuristic approaches: - One way to avoid the problem is to ignore the issue of resource identity altogether, and simply compare the identifiers themselves. In XPath, two QNames match if and only if the namespace names, as well as the local names, are identical. The namespace name in xmlns="http://WWW.EXAMPLE/" is no more identical to the namespace name in xmlns="http://www.example/" or to the namespace name in xmlns="http://WWW.EX%41MPLE/" than it is to the namespace name in xmlns="mid:xyz@example". Absolute URI references are strings; if any of the characters in the strings are different, the identifiers are different [@@cite charmod?]. - HTTP avoids the problem more subtly, by not attempting to answer it: the HTTP protocol does not specify any way for a client to determine that two different absolute URI references identify the same resource, and it provides only a few cases[ftH] in which a client can determine that two different absolute URI references identify different resources. - Emerging Semantic Web technologies include DAML+OIL[@@cite] and OWL[@@cite] define RDF properties such as equivalentTo, FunctionalProperty, and such to state -- or at least claim -- formally, that two absolute URI references identify the same resource. Whether such claims are to be trusted is a matter of local policy. - To decide wether to indicate to the user that a resource identified in a link is one that the user has already visited, web browsers use some inexpensive formal techniques and, since the cost of a false positive is fairly small, they use some inexpensive and fairly reliable heuristics. The HTTP specification formally specifies that case insensitivity of domain names carries thru: <tt>http://EXAMPLE/</tt> identifies the same resource that <tt>http://example/</tt> and <tt>http://example:80/</tt> identifiy. Informally, due to some widely deployed HTTP server filesystems and configuration defaults, it's quite likely that <tt>http://example/x</tt> and <tt>http://example/x/</tt> and perhaps even <tt>http://example/x/index.html</tt> identify the same resource, or at least: that they identify resources that are indistinguishable by the user. "Be conservative in what you produce and liberal in what you consume" is a maxim that applies, if somewhat counterintuitively, to the problem of resource identification: - information providers should be conservative by maximizing the consistency of identifiers used to refer to any given resource, and by ensuring sufficient difference betweeen identifiers used for different resources: Good practice note: avoid aliases: if you want to refer to a resource and you know an absolute URI reference refers to it, you SHOULD use the same absolute URI reference. (2) Algorithms for computing absolute URI references from parameters should be deterministic. In particular, use lower-case letters when using %xx escaping to encode filenames, for fields, etc. in URIs. Good practice note: when assigning absolute URI references to identifiers to resources, do not rely on case to distinguish resources; that is: do not assign http://example/myStuff and http://example/MyStuff to distinct resources; while absolute URI references are specified to be case sensitive, many filesystems and other underlying technologies are not; hence they will not support the distinction. - consumers should be liberal allowing information providers maximum freedom in naming resources. Even though producers SHOULD NOT use MyStuff and myStuff to identify different resources, they MAY, and clients that assume they refer to the same resource do so at their own risk. --- optional endnote [ftH] Consider: 1. C->S: GET /X S->C: 200 OK Last-Modified: 3p 2. C->S: GET /Y S->C: 200 OK Last-Modified: 4p 3. C->S: GET /X If-Modified-Since: 3p S->C: 200 OK Last-Modified: 5p 4. C->S: GET /Y If-Modified-Since: 4p S->C: 304 Not Modified After the 4th transaction, the client knows that /X refers to a resource that was modified at 5pm, but /Y refers to a resource that was not modified since 4pm; so they cannot refer to the same resource. ========= It's probably longer than it needs to be; I've pretty much exhausted my ability to work on it for the day, however. I think perhaps it addresses http://www.w3.org/2001/tag/ilist#URIEquivalence-15 exept that it could use an explicit example of escaped non-ascii stuff, ala Andre http://www.w3.org/2000/10/rdf-tests/rdfcore/rdf-charmod-uris/test001.rdf cited from http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref I wonder if we should deprecate/forbid the use of %xx-escaped stuff that has an unescaped analog... i.e. not go as far as saying that http://example/A%42C is the same identifier as http://example/ABC , but to say that the former is broken/deprecated/illegal, and suggest that it's sorta reasonable for clients to (a) halt and catch fire, or (b) notify the user and change it into the latter as a means of error recovery. The case of B vs %42 isn't as interesting as the case of non-ascii characters, but I don't keep a UTF-8 encoder in my head. The case of space, <, >, and " are also kinda interesting. -- Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Friday, 6 September 2002 17:16:26 UTC