Re: URIEquivalence-15 and IRIs from Chris Lilley on 2002-07-09 (www-tag@w3.org from July 2002)

From: Chris Lilley <chris@w3.org>
Date: Tue, 9 Jul 2002 13:15:11 +0200
To: www-tag@w3.org, Martin Duerst <duerst@w3.org>
CC: w3c-i18n-ig@w3.org
Message-ID: <103336743671.20020709131511@w3.org>
On Tuesday, July 9, 2002, 11:30:45 AM, Martin wrote:


MD> Dear TAG,

MD> Misha has already said that there is a new version of the IRI
MD> draft; this is now also officially available at
MD> http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt.

MD> I would like to draw your attention to Section 2.3,
MD> IRI Equivalence and Normalization, in particular to:

MD>      In some scenarios, such as XML Namespaces  ([XMLNamespace]), a
MD>      definite answer to the question of IRI equivalence is needed that is
MD>      independent of the scheme used and always can be calculated quickly
MD>      and without accessing a network.  In such cases, two IRIs SHOULD be
MD>      defined as equivalent if and only if they are character-by-character
MD>      equivalent (which is the same as byte-by-byte equivalent if the
MD>      character encoding for both IRIs is the same).  In such a case, the
MD>      comparison function MUST NOT map the IRIs to URIs.

MD> Please note that this makes an explicit interpretation of
MD> 'character-by-character', according with what we understand
MD> to be current practice.

So, it means that if normalization has ben done, two that look the
same will compare the same; if they have not, then the two might not
compare as equal and no software is going to fix that for you.

And it (the paragraph quoted) means that ~ and %7E are not the same.
Whereas your text below seems to say that they are (because of late
conversion, just before conversion). Please clarify which is correct.

MD> We plan to some last edits on this document around July 22nd,
MD> and then plan to send it off to the IESG. We would be glad to
MD> change the above if the TAG decides that something different
MD> is needed, but we would need a decision fairly soon.

MD> Many thanks in advance,     Martin.


MD> At 19:10 02/05/27 +0900, Martin Duerst wrote:
>>Dear TAG,
>>
>>Here is my input on the issue of URI/IRI equivalence, for
>>your consideration. This is a very important issue for IRIs.
>>
>>First and foremost, while it's okay to call the issue
>>'URIEquivalence-15', its resolution should really be a solution
>>both for URI equivalence and for IRI equivalence. While the
>>choices are the same in both cases, IRIs bring in additional
>>considerations.

Noted. I don't think the issue name needs to be changed as long as
that scope is clear.

>>The core choices from the view of IRIs are:
>>
>>a) 'character-by-character equivalence'
>>    (taking a %hh-escaping as three characters)
>>b) '%hh-escape equivalence' (equivalencing %hh-escape
>>    sequences with the characters (based on US-ASCII/UTF-8)
>>    they stand for (except for reserved characters!)
>>
>>The difference is more important for IRIs because the mapping
>>from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because
>>some protocols/formats/APIs will support IRIs whereas others
>>(older/lower level) may not, having both escaped and unescaped
>>versions of the same IRI is probably more frequent than for
>>URIs (where %7E / ~ is the only example I have seen).
>>This is a strong argument for %hh-escape equivalence.
>>
>>Because conversion from a URI to an IRI is not guaranteed to succeed,
>>and even if it succeeds, is not guaranteed to produce the correct
>>result (i.e. the original characters), it is important to convert
>>from IRIs to URIs as late as possible. For %hh-escape equivalence,
>>this means that %hh-escaping is only done for the actual comparison,
>>but that the original IRI is always retained. This would need a
>>certain amount of resources (time or space).
>>
>>The argument has been made that using character-by-character equivalence
>>would create strong pressures to not convert from IRIs to URIs prematurely,
>>which would be a good thing. It is difficult to judge whether this will
>>be the case; if things go well, it may indeed provide desirable
>>reinforcement, but if things go wrong, it may create additional confusion.
>>
>>It is thinkable to specify IRI equivalence by specifying character-by-
>>character equivalence for ASCII characters, and %hh-escape equivalence
>>for non-ascii characters. But the chance that this gets implemented
>>is probably very low.
>>
>>
>>The current version of the IRI draft
>>(http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt)
>>has been interpreted as prescribing %hh-escape equivalence,
>>because the draft clearly says that an IRI and the URI that it
>>is mapped to identify the same resource:
>>
>> >>>>
>>2.3 Mapping of IRIs to URIs
>>...
>>    This mapping has two purposes:
>>...
>>       b) Interpretational: URIs identify resources in various ways.
>>          IRIs also identify resources.  The resource that an IRI
>>          identifies is the same as the one identified by the URI
>>          obtained after converting the IRI according to the procedure
>>          defined here.  This means that there is no need to define the
>>          association between identifier and resource again on the IRI
>>          level.
>> >>>>
>>
>>But there is another interpretation: Because arbitrary URIs
>>can identify the same resource, e.g.
>>    http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
>>    http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
>>    http://www.w3.org/International/2002/draft-duerst-iri-00.txt
>>all identify the same resource, without allowing to deduce that
>>from their syntax, any artefacts (specs, software) that need to be
>>able to identify two resources as the same will need a mechanism
>>for doing that without relying on URI/IRI syntax anyway.
>>In other words, resource identity and resource identifier equivalence
>>are two different things.
>>
>>So for example, RDF could use character-for-character equivalence,
>>and something such as daml:sameIndividualAs can be used to indicate
>>that two URIs or IRIs refer to the same resource. It becomes then
>>mainly an issue of careful wording, to make sure that readers
>>do not confuse resource identifiers with resources.
>>
>>Anyway, we plan to adapt the wording in the IRI draft after
>>the TAG decision, to reflect the decision and to make the
>>implications clearer.
>>
>>In any case, it should be noted that while some specifications,
>>such as XML Namespaces or RDF, have to choose a single definition
>>of URI/IRI equivalence, other specifications and implementations
>>may choose to exploit additional knowledge. For example, proxies
>>will try to make as many assumptions as they can safely make
>>to reduce misses. Also, specifications that are closely related
>>to URI/IRI resolution may want to make similar assumptions.
>>For an example, see RFC 2616 (HTTP 1.1), section 3.2.3.
>>For another example, which specifically treats IRIs, see the XML
>>Catalogs spec, in particular
>>http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm.
>>
>>
>>While the question of whether to treat %hh sequences equivalent to
>>the characters they stand for or not is the most important aspec
>>of URIEquivalence-15 for IRIs, there are other aspects.
>>
>>First, it should be explicitly noted that equivalence on the
>>character level is applied after resolving different notations
>>in a 'carrier' (host) representation. As an example,
>>
>>       xlink:href='http://www.w3.org'
>>       xlink:href='http://www.w&#x33;.org'
>>    must be the same.
>>    [assuming xlink refers to the XLink namespace,
>>     and knowing that U+0033 is the letter '3']
>>
>>This of course depends on the carrier (host) language;
>>if you put http://www.w&#x33;.org into plain text email,
>>that's not a legal URI, and not the same as http://www.w3.org.
>>
>>Second, in some cases casing equivalences can be relevant.
>>In particular, the I18N WG has discussed whether e.g.
>>       http://www.w3.org/XM%4C and
>>       http://www.w3.org/XM%4c
>>should be the same identifier, independently of whether this is
>>the same identifier as
>>       http://www.w3.org/XML
>>There is an argument for making %4C and %4c the same, because
>>there is no clear convention of using upper-case or lower case
>>(in contrast to http:, where lower-case is dominant). Also, there
>>is never ever any doubt that they would refer to different resources.
>>
>>In general, case equivalence for characters outside ASCII is
>>language-dependent, and therefore should be avoided.
>>
>>
>>
>>The following contains collected language from all three URI
>>RFCs showing that %hh-equivalence would be a valid choice:
>>(I collected these quite a while ago, and wanted to make
>>sure they are not missed.)
>>
>>    The current URI spec says:
>>
>>    http://www.ietf.org/rfc/rfc2396.txt, section 2.3:
>>
>>    >>>>
>>    Unreserved characters can be escaped without changing the semantics
>>    of the URI, but this should not be done unless the URI is being used
>>    in a context that does not allow the unescaped character to appear.
>>    >>>>
>>
>>    (to go directly to the relevant section:
>>    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3)
>>
>>    [2.4.2. "When to Escape and Unescape", the escaping differences
>>    for reserved characters are defined as scheme-specific.]
>>
>>    Earlier URI/URL specs say:
>>
>>    http://www.ietf.org/rfc/rfc1738.txt, section 2.2:
>>
>>    Usually a URL has the same interpretation when an octet is
>>    represented by a character and when it encoded. However, this is not
>>    true for reserved characters: encoding a character reserved for a
>>    particular scheme may change the semantics of a URL.
>>
>>    (to go directly to the relevant section:
>>    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2)
>>
>>    And from http://www.ietf.org/rfc/rfc1630.txt:
>>
>>    >>>
>>       There is a conflict between the need to be able to represent many
>>       characters including spaces within a URI directly, and the need to
>>       be able to use a URI in environments which have limited character
>>       sets or in which certain characters are prone to corruption.  This
>>       conflict has been resolved by use of an hexadecimal escaping
>>       method which may be applied to any characters forbidden in a given
>>       context.  When URLs are moved between contexts, the set of
>>       characters escaped may be enlarged or reduced unambiguously.
>>
>>    REDUCED OR INCREASED SAFE CHARACTER SETS
>>
>>       The same encoding method may be used for encoding characters whose
>>       use, although technically allowed in a URI, would be unwise due to
>>       problems of corruption by imperfect gateways or misrepresentation
>>       due to the use of variant character sets, or which would simply be
>>       awkward in a given environment.  Because a % sign always indicates
>>       an encoded character, a URI may be made "safer" simply by encoding
>>       any characters considered unsafe, while leaving already encoded
>>       characters still encoded.  Similarly, in cases where a larger set
>>       of characters is acceptable, % signs can be selectively and
>>       reversibly expanded.
>>
>>       Before two URIs can be compared, it is therefore necessary to
>>       bring them to the same encoding level.
>>
>>       However, the reserved characters mentioned above have a quite
>>       different significance when encoded, and so may NEVER be encoded
>>       and unencoded in this way.
>>
>>    ...
>>
>>    Example 1
>>
>>    The URIs
>>
>>                 http://info.cern.ch/albert/bertram/marie-claude
>>
>>    and
>>
>>                 http://info.cern.ch/albert/bertram/marie%2Dclaude
>>
>>    are identical, as the %2D encodes a hyphen character.
>>    >>>>
>>
>>
>>Regards,    Martin.



-- 
 Chris                            mailto:chris@w3.org
Received on Tuesday, 9 July 2002 07:15:34 UTC