Re: Posted draft of URI comparison finding from Larry Masinter on 2002-12-02 (www-tag@w3.org from December 2002)

From: Larry Masinter <LMM@acm.org>
Date: Mon, 2 Dec 2002 08:52:12 -0800
To: <tbray@textuality.com>
Cc: <www-tag@w3.org>
Message-ID: <001101c29a23$29e52090$6ace8642@MASINTER>

I think at the root of many of the difficulties
in this discussion is what seems like an
assumption that there is (or should be) only
one notion of "equivalence" for URIs.

But among the set of all possible strings
(sequences of characters), there are many different
equivalence relationships.

The base equivalence relationship is
character-by-character: 
"string A is equivalent to string B if
they consist of exactly the same characters,
compared character by character". 

But it is easy to construct other equivalence
relationships. Every mapping F from 
sequence-of-character into some
other space defines an equivalence relationship
F=, where A F= B if F(A)=F(B). To name two
important such mappings, the "normalization"
mappings (various modes of Unicode normalization)
each create an equivalence relationship, where
all sequences that normalize to the same string
are equivalent. The mapping HE that transforms
an arbitrary string by hex-encoding all characters
normally disallowed in URIs (space, outside the 7-bit
ASCII repertoire) creates another equivalence
relationship: IRI1 HE= IRI2 if HE(IRI1) = HE(IRI2).

Every application which requires a notion of
equivalence needs to specify which equivalence
relationship it uses.  The application of
"maintaining a web cache" can use an aggressive
equivalence relationship based on its knowledge
about default ports, case independence of host
names and scheme names. But the application of XML
parsers comparing namespace names should use
the computationally simpler and more stable
relationship of "character-by-character"
equivalence.

You can't get rid of multiple equivalence
relationships or mandate them out of existence,
although it might help to minimize the number
of different equivalence relationships in use.

And it would be helpful for applications that
use a fine granularity equivalence relationship
to avoid the situation where two identifiers
that are equivalent under some relationships
are used differently, e.g., mandate that once
a namespace name is assigned, no other string,
equivalent under coarser rules, should also
be used as a namespace name.

Note that this message does not talk about
'resource equivalence', or even about
the resources identified and the relationship
of those resources. You might think that you
could use the mapping from Resource Identifiers
to Resources (ID) to say "two Resource Identifiers
are ID= equivalent if the resources they
identify are the same", but we don't have a
good definition of 'same' for 'resources',
or any way of creating algorithms based on it.

So I think it is useful to keep the discussion
in the domain of string manipulation, and different
kinds of string equivalences.

Larry

Received on Monday, 2 December 2002 11:52:59 UTC