Re: URIEquivalence-15 and IRIs from Misha.Wolf@reuters.com on 2002-07-09 (www-tag@w3.org from July 2002)

From: <Misha.Wolf@reuters.com>
Date: Tue, 09 Jul 2002 12:09:44 +0100
To: Martin Duerst <duerst@w3.org>
Cc: w3c-i18n-ig@w3.org, www-tag@w3.org
Message-ID: <T5bfbb393d1c407b706150@reuters.com>
Hi Martin,

I think the IRI spec [1] should state explicitly that by "character-by-
character equivalent" we mean that all of these (taken from a para a bit
further on) are different:
-  foo://example.com/XML
-  foo://example.com/XM%4C
-  foo://example.com/XM%4c

After all, the Namespaces spec [2] states that:
   [Definition:] URI references which identify namespaces are considered
   identical when they are exactly the same character-for-character.
and there has been discussion of what exactly this means.  Just repeating
it won't, IMO, clear up the confusion.

[1] http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt
[2] http://www.w3.org/TR/REC-xml-names

Thanks,
Misha


On 09/07/2002 10:30:45 Martin Duerst wrote:
> Dear TAG,
>
> Misha has already said that there is a new version of the IRI
> draft; this is now also officially available at
> http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt.
>
> I would like to draw your attention to Section 2.3,
> IRI Equivalence and Normalization, in particular to:
>
>      In some scenarios, such as XML Namespaces  ([XMLNamespace]), a
>      definite answer to the question of IRI equivalence is needed that is
>      independent of the scheme used and always can be calculated quickly
>      and without accessing a network.  In such cases, two IRIs SHOULD be
>      defined as equivalent if and only if they are character-by-character
>      equivalent (which is the same as byte-by-byte equivalent if the
>      character encoding for both IRIs is the same).  In such a case, the
>      comparison function MUST NOT map the IRIs to URIs.
>
> Please note that this makes an explicit interpretation of
> 'character-by-character', according with what we understand
> to be current practice.
>
> We plan to some last edits on this document around July 22nd,
> and then plan to send it off to the IESG. We would be glad to
> change the above if the TAG decides that something different
> is needed, but we would need a decision fairly soon.
>
> Many thanks in advance,     Martin.
>
>
> At 19:10 02/05/27 +0900, Martin Duerst wrote:
> >Dear TAG,
> >
> >Here is my input on the issue of URI/IRI equivalence, for
> >your consideration. This is a very important issue for IRIs.
> >
> >First and foremost, while it's okay to call the issue
> >'URIEquivalence-15', its resolution should really be a solution
> >both for URI equivalence and for IRI equivalence. While the
> >choices are the same in both cases, IRIs bring in additional
> >considerations.
> >
> >The core choices from the view of IRIs are:
> >
> >a) 'character-by-character equivalence'
> >    (taking a %hh-escaping as three characters)
> >b) '%hh-escape equivalence' (equivalencing %hh-escape
> >    sequences with the characters (based on US-ASCII/UTF-8)
> >    they stand for (except for reserved characters!)
> >
> >The difference is more important for IRIs because the mapping
> >from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because
> >some protocols/formats/APIs will support IRIs whereas others
> >(older/lower level) may not, having both escaped and unescaped
> >versions of the same IRI is probably more frequent than for
> >URIs (where %7E / ~ is the only example I have seen).
> >This is a strong argument for %hh-escape equivalence.
> >
> >Because conversion from a URI to an IRI is not guaranteed to succeed,
> >and even if it succeeds, is not guaranteed to produce the correct
> >result (i.e. the original characters), it is important to convert
> >from IRIs to URIs as late as possible. For %hh-escape equivalence,
> >this means that %hh-escaping is only done for the actual comparison,
> >but that the original IRI is always retained. This would need a
> >certain amount of resources (time or space).
> >
> >The argument has been made that using character-by-character equivalence
> >would create strong pressures to not convert from IRIs to URIs prematurely,
> >which would be a good thing. It is difficult to judge whether this will
> >be the case; if things go well, it may indeed provide desirable
> >reinforcement, but if things go wrong, it may create additional confusion.
> >
> >It is thinkable to specify IRI equivalence by specifying character-by-
> >character equivalence for ASCII characters, and %hh-escape equivalence
> >for non-ascii characters. But the chance that this gets implemented
> >is probably very low.
> >
> >
> >The current version of the IRI draft
> >(http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt)
> >has been interpreted as prescribing %hh-escape equivalence,
> >because the draft clearly says that an IRI and the URI that it
> >is mapped to identify the same resource:
> >
> > >>>>
> >2.3 Mapping of IRIs to URIs
> >...
> >    This mapping has two purposes:
> >...
> >       b) Interpretational: URIs identify resources in various ways.
> >          IRIs also identify resources.  The resource that an IRI
> >          identifies is the same as the one identified by the URI
> >          obtained after converting the IRI according to the procedure
> >          defined here.  This means that there is no need to define the
> >          association between identifier and resource again on the IRI
> >          level.
> > >>>>
> >
> >But there is another interpretation: Because arbitrary URIs
> >can identify the same resource, e.g.
> >    http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
> >    http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
> >    http://www.w3.org/International/2002/draft-duerst-iri-00.txt
> >all identify the same resource, without allowing to deduce that
> >from their syntax, any artefacts (specs, software) that need to be
> >able to identify two resources as the same will need a mechanism
> >for doing that without relying on URI/IRI syntax anyway.
> >In other words, resource identity and resource identifier equivalence
> >are two different things.
> >
> >So for example, RDF could use character-for-character equivalence,
> >and something such as daml:sameIndividualAs can be used to indicate
> >that two URIs or IRIs refer to the same resource. It becomes then
> >mainly an issue of careful wording, to make sure that readers
> >do not confuse resource identifiers with resources.
> >
> >Anyway, we plan to adapt the wording in the IRI draft after
> >the TAG decision, to reflect the decision and to make the
> >implications clearer.
> >
> >In any case, it should be noted that while some specifications,
> >such as XML Namespaces or RDF, have to choose a single definition
> >of URI/IRI equivalence, other specifications and implementations
> >may choose to exploit additional knowledge. For example, proxies
> >will try to make as many assumptions as they can safely make
> >to reduce misses. Also, specifications that are closely related
> >to URI/IRI resolution may want to make similar assumptions.
> >For an example, see RFC 2616 (HTTP 1.1), section 3.2.3.
> >For another example, which specifically treats IRIs, see the XML
> >Catalogs spec, in particular
> >http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm.
> >
> >
> >While the question of whether to treat %hh sequences equivalent to
> >the characters they stand for or not is the most important aspec
> >of URIEquivalence-15 for IRIs, there are other aspects.
> >
> >First, it should be explicitly noted that equivalence on the
> >character level is applied after resolving different notations
> >in a 'carrier' (host) representation. As an example,
> >
> >       xlink:href='http://www.w3.org'
> >       xlink:href='http://www.w&#x33;.org'
> >    must be the same.
> >    [assuming xlink refers to the XLink namespace,
> >     and knowing that U+0033 is the letter '3']
> >
> >This of course depends on the carrier (host) language;
> >if you put http://www.w&#x33;.org into plain text email,
> >that's not a legal URI, and not the same as http://www.w3.org.
> >
> >Second, in some cases casing equivalences can be relevant.
> >In particular, the I18N WG has discussed whether e.g.
> >       http://www.w3.org/XM%4C and
> >       http://www.w3.org/XM%4c
> >should be the same identifier, independently of whether this is
> >the same identifier as
> >       http://www.w3.org/XML
> >There is an argument for making %4C and %4c the same, because
> >there is no clear convention of using upper-case or lower case
> >(in contrast to http:, where lower-case is dominant). Also, there
> >is never ever any doubt that they would refer to different resources.
> >
> >In general, case equivalence for characters outside ASCII is
> >language-dependent, and therefore should be avoided.
> >
> >
> >
> >The following contains collected language from all three URI
> >RFCs showing that %hh-equivalence would be a valid choice:
> >(I collected these quite a while ago, and wanted to make
> >sure they are not missed.)
> >
> >    The current URI spec says:
> >
> >    http://www.ietf.org/rfc/rfc2396.txt, section 2.3:
> >
> >    >>>>
> >    Unreserved characters can be escaped without changing the semantics
> >    of the URI, but this should not be done unless the URI is being used
> >    in a context that does not allow the unescaped character to appear.
> >    >>>>
> >
> >    (to go directly to the relevant section:
> >    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3)
> >
> >    [2.4.2. "When to Escape and Unescape", the escaping differences
> >    for reserved characters are defined as scheme-specific.]
> >
> >    Earlier URI/URL specs say:
> >
> >    http://www.ietf.org/rfc/rfc1738.txt, section 2.2:
> >
> >    Usually a URL has the same interpretation when an octet is
> >    represented by a character and when it encoded. However, this is not
> >    true for reserved characters: encoding a character reserved for a
> >    particular scheme may change the semantics of a URL.
> >
> >    (to go directly to the relevant section:
> >    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2)
> >
> >    And from http://www.ietf.org/rfc/rfc1630.txt:
> >
> >    >>>
> >       There is a conflict between the need to be able to represent many
> >       characters including spaces within a URI directly, and the need to
> >       be able to use a URI in environments which have limited character
> >       sets or in which certain characters are prone to corruption.  This
> >       conflict has been resolved by use of an hexadecimal escaping
> >       method which may be applied to any characters forbidden in a given
> >       context.  When URLs are moved between contexts, the set of
> >       characters escaped may be enlarged or reduced unambiguously.
> >
> >    REDUCED OR INCREASED SAFE CHARACTER SETS
> >
> >       The same encoding method may be used for encoding characters whose
> >       use, although technically allowed in a URI, would be unwise due to
> >       problems of corruption by imperfect gateways or misrepresentation
> >       due to the use of variant character sets, or which would simply be
> >       awkward in a given environment.  Because a % sign always indicates
> >       an encoded character, a URI may be made "safer" simply by encoding
> >       any characters considered unsafe, while leaving already encoded
> >       characters still encoded.  Similarly, in cases where a larger set
> >       of characters is acceptable, % signs can be selectively and
> >       reversibly expanded.
> >
> >       Before two URIs can be compared, it is therefore necessary to
> >       bring them to the same encoding level.
> >
> >       However, the reserved characters mentioned above have a quite
> >       different significance when encoded, and so may NEVER be encoded
> >       and unencoded in this way.
> >
> >    ...
> >
> >    Example 1
> >
> >    The URIs
> >
> >                 http://info.cern.ch/albert/bertram/marie-claude
> >
> >    and
> >
> >                 http://info.cern.ch/albert/bertram/marie%2Dclaude
> >
> >    are identical, as the %2D encodes a hyphen character.
> >    >>>>
> >
> >
> >Regards,    Martin.
>



-------------------------------------------------------------- --
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.
Received on Tuesday, 9 July 2002 07:12:44 UTC