Re: URIEquivalence-15 and IRIs from Martin Duerst on 2002-07-09 (www-tag@w3.org from July 2002)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 09 Jul 2002 18:30:45 +0900
To: www-tag@w3.org
Cc: w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20020708184305.080e7c98@localhost>
Dear TAG,

Misha has already said that there is a new version of the IRI
draft; this is now also officially available at
http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt.

I would like to draw your attention to Section 2.3,
IRI Equivalence and Normalization, in particular to:

     In some scenarios, such as XML Namespaces  ([XMLNamespace]), a
     definite answer to the question of IRI equivalence is needed that is
     independent of the scheme used and always can be calculated quickly
     and without accessing a network.  In such cases, two IRIs SHOULD be
     defined as equivalent if and only if they are character-by-character
     equivalent (which is the same as byte-by-byte equivalent if the
     character encoding for both IRIs is the same).  In such a case, the
     comparison function MUST NOT map the IRIs to URIs.

Please note that this makes an explicit interpretation of
'character-by-character', according with what we understand
to be current practice.

We plan to some last edits on this document around July 22nd,
and then plan to send it off to the IESG. We would be glad to
change the above if the TAG decides that something different
is needed, but we would need a decision fairly soon.

Many thanks in advance,     Martin.


At 19:10 02/05/27 +0900, Martin Duerst wrote:
>Dear TAG,
>
>Here is my input on the issue of URI/IRI equivalence, for
>your consideration. This is a very important issue for IRIs.
>
>First and foremost, while it's okay to call the issue
>'URIEquivalence-15', its resolution should really be a solution
>both for URI equivalence and for IRI equivalence. While the
>choices are the same in both cases, IRIs bring in additional
>considerations.
>
>The core choices from the view of IRIs are:
>
>a) 'character-by-character equivalence'
>    (taking a %hh-escaping as three characters)
>b) '%hh-escape equivalence' (equivalencing %hh-escape
>    sequences with the characters (based on US-ASCII/UTF-8)
>    they stand for (except for reserved characters!)
>
>The difference is more important for IRIs because the mapping
>from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because
>some protocols/formats/APIs will support IRIs whereas others
>(older/lower level) may not, having both escaped and unescaped
>versions of the same IRI is probably more frequent than for
>URIs (where %7E / ~ is the only example I have seen).
>This is a strong argument for %hh-escape equivalence.
>
>Because conversion from a URI to an IRI is not guaranteed to succeed,
>and even if it succeeds, is not guaranteed to produce the correct
>result (i.e. the original characters), it is important to convert
>from IRIs to URIs as late as possible. For %hh-escape equivalence,
>this means that %hh-escaping is only done for the actual comparison,
>but that the original IRI is always retained. This would need a
>certain amount of resources (time or space).
>
>The argument has been made that using character-by-character equivalence
>would create strong pressures to not convert from IRIs to URIs prematurely,
>which would be a good thing. It is difficult to judge whether this will
>be the case; if things go well, it may indeed provide desirable
>reinforcement, but if things go wrong, it may create additional confusion.
>
>It is thinkable to specify IRI equivalence by specifying character-by-
>character equivalence for ASCII characters, and %hh-escape equivalence
>for non-ascii characters. But the chance that this gets implemented
>is probably very low.
>
>
>The current version of the IRI draft
>(http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt)
>has been interpreted as prescribing %hh-escape equivalence,
>because the draft clearly says that an IRI and the URI that it
>is mapped to identify the same resource:
>
> >>>>
>2.3 Mapping of IRIs to URIs
>...
>    This mapping has two purposes:
>...
>       b) Interpretational: URIs identify resources in various ways.
>          IRIs also identify resources.  The resource that an IRI
>          identifies is the same as the one identified by the URI
>          obtained after converting the IRI according to the procedure
>          defined here.  This means that there is no need to define the
>          association between identifier and resource again on the IRI
>          level.
> >>>>
>
>But there is another interpretation: Because arbitrary URIs
>can identify the same resource, e.g.
>    http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
>    http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
>    http://www.w3.org/International/2002/draft-duerst-iri-00.txt
>all identify the same resource, without allowing to deduce that
>from their syntax, any artefacts (specs, software) that need to be
>able to identify two resources as the same will need a mechanism
>for doing that without relying on URI/IRI syntax anyway.
>In other words, resource identity and resource identifier equivalence
>are two different things.
>
>So for example, RDF could use character-for-character equivalence,
>and something such as daml:sameIndividualAs can be used to indicate
>that two URIs or IRIs refer to the same resource. It becomes then
>mainly an issue of careful wording, to make sure that readers
>do not confuse resource identifiers with resources.
>
>Anyway, we plan to adapt the wording in the IRI draft after
>the TAG decision, to reflect the decision and to make the
>implications clearer.
>
>In any case, it should be noted that while some specifications,
>such as XML Namespaces or RDF, have to choose a single definition
>of URI/IRI equivalence, other specifications and implementations
>may choose to exploit additional knowledge. For example, proxies
>will try to make as many assumptions as they can safely make
>to reduce misses. Also, specifications that are closely related
>to URI/IRI resolution may want to make similar assumptions.
>For an example, see RFC 2616 (HTTP 1.1), section 3.2.3.
>For another example, which specifically treats IRIs, see the XML
>Catalogs spec, in particular
>http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm.
>
>
>While the question of whether to treat %hh sequences equivalent to
>the characters they stand for or not is the most important aspec
>of URIEquivalence-15 for IRIs, there are other aspects.
>
>First, it should be explicitly noted that equivalence on the
>character level is applied after resolving different notations
>in a 'carrier' (host) representation. As an example,
>
>       xlink:href='http://www.w3.org'
>       xlink:href='http://www.w&#x33;.org'
>    must be the same.
>    [assuming xlink refers to the XLink namespace,
>     and knowing that U+0033 is the letter '3']
>
>This of course depends on the carrier (host) language;
>if you put http://www.w&#x33;.org into plain text email,
>that's not a legal URI, and not the same as http://www.w3.org.
>
>Second, in some cases casing equivalences can be relevant.
>In particular, the I18N WG has discussed whether e.g.
>       http://www.w3.org/XM%4C and
>       http://www.w3.org/XM%4c
>should be the same identifier, independently of whether this is
>the same identifier as
>       http://www.w3.org/XML
>There is an argument for making %4C and %4c the same, because
>there is no clear convention of using upper-case or lower case
>(in contrast to http:, where lower-case is dominant). Also, there
>is never ever any doubt that they would refer to different resources.
>
>In general, case equivalence for characters outside ASCII is
>language-dependent, and therefore should be avoided.
>
>
>
>The following contains collected language from all three URI
>RFCs showing that %hh-equivalence would be a valid choice:
>(I collected these quite a while ago, and wanted to make
>sure they are not missed.)
>
>    The current URI spec says:
>
>    http://www.ietf.org/rfc/rfc2396.txt, section 2.3:
>
>    >>>>
>    Unreserved characters can be escaped without changing the semantics
>    of the URI, but this should not be done unless the URI is being used
>    in a context that does not allow the unescaped character to appear.
>    >>>>
>
>    (to go directly to the relevant section:
>    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3)
>
>    [2.4.2. "When to Escape and Unescape", the escaping differences
>    for reserved characters are defined as scheme-specific.]
>
>    Earlier URI/URL specs say:
>
>    http://www.ietf.org/rfc/rfc1738.txt, section 2.2:
>
>    Usually a URL has the same interpretation when an octet is
>    represented by a character and when it encoded. However, this is not
>    true for reserved characters: encoding a character reserved for a
>    particular scheme may change the semantics of a URL.
>
>    (to go directly to the relevant section:
>    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2)
>
>    And from http://www.ietf.org/rfc/rfc1630.txt:
>
>    >>>
>       There is a conflict between the need to be able to represent many
>       characters including spaces within a URI directly, and the need to
>       be able to use a URI in environments which have limited character
>       sets or in which certain characters are prone to corruption.  This
>       conflict has been resolved by use of an hexadecimal escaping
>       method which may be applied to any characters forbidden in a given
>       context.  When URLs are moved between contexts, the set of
>       characters escaped may be enlarged or reduced unambiguously.
>
>    REDUCED OR INCREASED SAFE CHARACTER SETS
>
>       The same encoding method may be used for encoding characters whose
>       use, although technically allowed in a URI, would be unwise due to
>       problems of corruption by imperfect gateways or misrepresentation
>       due to the use of variant character sets, or which would simply be
>       awkward in a given environment.  Because a % sign always indicates
>       an encoded character, a URI may be made "safer" simply by encoding
>       any characters considered unsafe, while leaving already encoded
>       characters still encoded.  Similarly, in cases where a larger set
>       of characters is acceptable, % signs can be selectively and
>       reversibly expanded.
>
>       Before two URIs can be compared, it is therefore necessary to
>       bring them to the same encoding level.
>
>       However, the reserved characters mentioned above have a quite
>       different significance when encoded, and so may NEVER be encoded
>       and unencoded in this way.
>
>    ...
>
>    Example 1
>
>    The URIs
>
>                 http://info.cern.ch/albert/bertram/marie-claude
>
>    and
>
>                 http://info.cern.ch/albert/bertram/marie%2Dclaude
>
>    are identical, as the %2D encodes a hyphen character.
>    >>>>
>
>
>Regards,    Martin.
Received on Tuesday, 9 July 2002 06:18:22 UTC