URIEquivalence-15 and IRIs from Martin Duerst on 2002-05-27 (www-tag@w3.org from May 2002)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 27 May 2002 19:10:54 +0900
To: www-tag@w3.org
Cc: w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20020501175158.02bcadd0@localhost>
Dear TAG,

Here is my input on the issue of URI/IRI equivalence, for
your consideration. This is a very important issue for IRIs.

First and foremost, while it's okay to call the issue
'URIEquivalence-15', its resolution should really be a solution
both for URI equivalence and for IRI equivalence. While the
choices are the same in both cases, IRIs bring in additional
considerations.

The core choices from the view of IRIs are:

a) 'character-by-character equivalence'
    (taking a %hh-escaping as three characters)
b) '%hh-escape equivalence' (equivalencing %hh-escape
    sequences with the characters (based on US-ASCII/UTF-8)
    they stand for (except for reserved characters!)

The difference is more important for IRIs because the mapping
from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because
some protocols/formats/APIs will support IRIs whereas others
(older/lower level) may not, having both escaped and unescaped
versions of the same IRI is probably more frequent than for
URIs (where %7E / ~ is the only example I have seen).
This is a strong argument for %hh-escape equivalence.

Because conversion from a URI to an IRI is not guaranteed to succeed,
and even if it succeeds, is not guaranteed to produce the correct
result (i.e. the original characters), it is important to convert
from IRIs to URIs as late as possible. For %hh-escape equivalence,
this means that %hh-escaping is only done for the actual comparison,
but that the original IRI is always retained. This would need a
certain amount of resources (time or space).

The argument has been made that using character-by-character equivalence
would create strong pressures to not convert from IRIs to URIs prematurely,
which would be a good thing. It is difficult to judge whether this will
be the case; if things go well, it may indeed provide desirable
reinforcement, but if things go wrong, it may create additional confusion.

It is thinkable to specify IRI equivalence by specifying character-by-
character equivalence for ASCII characters, and %hh-escape equivalence
for non-ascii characters. But the chance that this gets implemented
is probably very low.


The current version of the IRI draft
(http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt)
has been interpreted as prescribing %hh-escape equivalence,
because the draft clearly says that an IRI and the URI that it
is mapped to identify the same resource:

 >>>>
2.3 Mapping of IRIs to URIs
...
    This mapping has two purposes:
...
       b) Interpretational: URIs identify resources in various ways.
          IRIs also identify resources.  The resource that an IRI
          identifies is the same as the one identified by the URI
          obtained after converting the IRI according to the procedure
          defined here.  This means that there is no need to define the
          association between identifier and resource again on the IRI
          level.
 >>>>

But there is another interpretation: Because arbitrary URIs
can identify the same resource, e.g.
    http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
    http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
    http://www.w3.org/International/2002/draft-duerst-iri-00.txt
all identify the same resource, without allowing to deduce that
from their syntax, any artefacts (specs, software) that need to be
able to identify two resources as the same will need a mechanism
for doing that without relying on URI/IRI syntax anyway.
In other words, resource identity and resource identifier equivalence
are two different things.

So for example, RDF could use character-for-character equivalence,
and something such as daml:sameIndividualAs can be used to indicate
that two URIs or IRIs refer to the same resource. It becomes then
mainly an issue of careful wording, to make sure that readers
do not confuse resource identifiers with resources.

Anyway, we plan to adapt the wording in the IRI draft after
the TAG decision, to reflect the decision and to make the
implications clearer.

In any case, it should be noted that while some specifications,
such as XML Namespaces or RDF, have to choose a single definition
of URI/IRI equivalence, other specifications and implementations
may choose to exploit additional knowledge. For example, proxies
will try to make as many assumptions as they can safely make
to reduce misses. Also, specifications that are closely related
to URI/IRI resolution may want to make similar assumptions.
For an example, see RFC 2616 (HTTP 1.1), section 3.2.3.
For another example, which specifically treats IRIs, see the XML
Catalogs spec, in particular
http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm.


While the question of whether to treat %hh sequences equivalent to
the characters they stand for or not is the most important aspec
of URIEquivalence-15 for IRIs, there are other aspects.

First, it should be explicitly noted that equivalence on the
character level is applied after resolving different notations
in a 'carrier' (host) representation. As an example,

       xlink:href='http://www.w3.org'
       xlink:href='http://www.w&#x33;.org'
    must be the same.
    [assuming xlink refers to the XLink namespace,
     and knowing that U+0033 is the letter '3']

This of course depends on the carrier (host) language;
if you put http://www.w&#x33;.org into plain text email,
that's not a legal URI, and not the same as http://www.w3.org.

Second, in some cases casing equivalences can be relevant.
In particular, the I18N WG has discussed whether e.g.
       http://www.w3.org/XM%4C and
       http://www.w3.org/XM%4c
should be the same identifier, independently of whether this is
the same identifier as
       http://www.w3.org/XML
There is an argument for making %4C and %4c the same, because
there is no clear convention of using upper-case or lower case
(in contrast to http:, where lower-case is dominant). Also, there
is never ever any doubt that they would refer to different resources.

In general, case equivalence for characters outside ASCII is
language-dependent, and therefore should be avoided.



The following contains collected language from all three URI
RFCs showing that %hh-equivalence would be a valid choice:
(I collected these quite a while ago, and wanted to make
sure they are not missed.)

    The current URI spec says:

    http://www.ietf.org/rfc/rfc2396.txt, section 2.3:

    >>>>
    Unreserved characters can be escaped without changing the semantics
    of the URI, but this should not be done unless the URI is being used
    in a context that does not allow the unescaped character to appear.
    >>>>

    (to go directly to the relevant section:
    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3)

    [2.4.2. "When to Escape and Unescape", the escaping differences
    for reserved characters are defined as scheme-specific.]

    Earlier URI/URL specs say:

    http://www.ietf.org/rfc/rfc1738.txt, section 2.2:

    Usually a URL has the same interpretation when an octet is
    represented by a character and when it encoded. However, this is not
    true for reserved characters: encoding a character reserved for a
    particular scheme may change the semantics of a URL.

    (to go directly to the relevant section:
    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2)

    And from http://www.ietf.org/rfc/rfc1630.txt:

    >>>
       There is a conflict between the need to be able to represent many
       characters including spaces within a URI directly, and the need to
       be able to use a URI in environments which have limited character
       sets or in which certain characters are prone to corruption.  This
       conflict has been resolved by use of an hexadecimal escaping
       method which may be applied to any characters forbidden in a given
       context.  When URLs are moved between contexts, the set of
       characters escaped may be enlarged or reduced unambiguously.

    REDUCED OR INCREASED SAFE CHARACTER SETS

       The same encoding method may be used for encoding characters whose
       use, although technically allowed in a URI, would be unwise due to
       problems of corruption by imperfect gateways or misrepresentation
       due to the use of variant character sets, or which would simply be
       awkward in a given environment.  Because a % sign always indicates
       an encoded character, a URI may be made "safer" simply by encoding
       any characters considered unsafe, while leaving already encoded
       characters still encoded.  Similarly, in cases where a larger set
       of characters is acceptable, % signs can be selectively and
       reversibly expanded.

       Before two URIs can be compared, it is therefore necessary to
       bring them to the same encoding level.

       However, the reserved characters mentioned above have a quite
       different significance when encoded, and so may NEVER be encoded
       and unencoded in this way.

    ...

    Example 1

    The URIs

                 http://info.cern.ch/albert/bertram/marie-claude

    and

                 http://info.cern.ch/albert/bertram/marie%2Dclaude

    are identical, as the %2D encodes a hyphen character.
    >>>>


Regards,    Martin.
Received on Monday, 27 May 2002 07:46:21 UTC