- From: Larry Masinter <LMM@acm.org>
- Date: Mon, 2 Dec 2002 08:52:12 -0800
- To: <tbray@textuality.com>
- Cc: <www-tag@w3.org>
I think at the root of many of the difficulties in this discussion is what seems like an assumption that there is (or should be) only one notion of "equivalence" for URIs. But among the set of all possible strings (sequences of characters), there are many different equivalence relationships. The base equivalence relationship is character-by-character: "string A is equivalent to string B if they consist of exactly the same characters, compared character by character". But it is easy to construct other equivalence relationships. Every mapping F from sequence-of-character into some other space defines an equivalence relationship F=, where A F= B if F(A)=F(B). To name two important such mappings, the "normalization" mappings (various modes of Unicode normalization) each create an equivalence relationship, where all sequences that normalize to the same string are equivalent. The mapping HE that transforms an arbitrary string by hex-encoding all characters normally disallowed in URIs (space, outside the 7-bit ASCII repertoire) creates another equivalence relationship: IRI1 HE= IRI2 if HE(IRI1) = HE(IRI2). Every application which requires a notion of equivalence needs to specify which equivalence relationship it uses. The application of "maintaining a web cache" can use an aggressive equivalence relationship based on its knowledge about default ports, case independence of host names and scheme names. But the application of XML parsers comparing namespace names should use the computationally simpler and more stable relationship of "character-by-character" equivalence. You can't get rid of multiple equivalence relationships or mandate them out of existence, although it might help to minimize the number of different equivalence relationships in use. And it would be helpful for applications that use a fine granularity equivalence relationship to avoid the situation where two identifiers that are equivalent under some relationships are used differently, e.g., mandate that once a namespace name is assigned, no other string, equivalent under coarser rules, should also be used as a namespace name. Note that this message does not talk about 'resource equivalence', or even about the resources identified and the relationship of those resources. You might think that you could use the mapping from Resource Identifiers to Resources (ID) to say "two Resource Identifiers are ID= equivalent if the resources they identify are the same", but we don't have a good definition of 'same' for 'resources', or any way of creating algorithms based on it. So I think it is useful to keep the discussion in the domain of string manipulation, and different kinds of string equivalences. Larry
Received on Monday, 2 December 2002 11:52:59 UTC