Re: URIEquivalence-15 and IRIs from Martin Duerst on 2002-07-09 (www-tag@w3.org from July 2002)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 10 Jul 2002 00:04:26 +0900
To: Chris Lilley <chris@w3.org>, www-tag@w3.org
Cc: w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20020710000140.0542ae88@localhost>
At 13:15 02/07/09 +0200, Chris Lilley wrote:
>On Tuesday, July 9, 2002, 11:30:45 AM, Martin wrote:
>
>
>MD> Dear TAG,
>
>MD> Misha has already said that there is a new version of the IRI
>MD> draft; this is now also officially available at
>MD> http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt.
>
>MD> I would like to draw your attention to Section 2.3,
>MD> IRI Equivalence and Normalization, in particular to:
>
>MD>      In some scenarios, such as XML Namespaces  ([XMLNamespace]), a
>MD>      definite answer to the question of IRI equivalence is needed that is
>MD>      independent of the scheme used and always can be calculated quickly
>MD>      and without accessing a network.  In such cases, two IRIs SHOULD be
>MD>      defined as equivalent if and only if they are character-by-character
>MD>      equivalent (which is the same as byte-by-byte equivalent if the
>MD>      character encoding for both IRIs is the same).  In such a case, the
>MD>      comparison function MUST NOT map the IRIs to URIs.
>
>MD> Please note that this makes an explicit interpretation of
>MD> 'character-by-character', according with what we understand
>MD> to be current practice.
>
>So, it means that if normalization has ben done, two that look the
>same will compare the same; if they have not, then the two might not
>compare as equal and no software is going to fix that for you.
>
>And it (the paragraph quoted) means that ~ and %7E are not the same.

Yes. I plan to add explicit examples tomorrow.


>Whereas your text below seems to say that they are (because of late
>conversion, just before conversion).

Which text below? Can you explain/correct 'conversion before conversion'?


>Please clarify which is correct.
>
>MD> We plan to some last edits on this document around July 22nd,
>MD> and then plan to send it off to the IESG. We would be glad to
>MD> change the above if the TAG decides that something different
>MD> is needed, but we would need a decision fairly soon.
>
>MD> Many thanks in advance,     Martin.
>
>
>MD> At 19:10 02/05/27 +0900, Martin Duerst wrote:
> >>Dear TAG,
> >>
> >>Here is my input on the issue of URI/IRI equivalence, for
> >>your consideration. This is a very important issue for IRIs.
> >>
> >>First and foremost, while it's okay to call the issue
> >>'URIEquivalence-15', its resolution should really be a solution
> >>both for URI equivalence and for IRI equivalence. While the
> >>choices are the same in both cases, IRIs bring in additional
> >>considerations.
>
>Noted. I don't think the issue name needs to be changed as long as
>that scope is clear.

As I said: "while it's okay to call the issue 'URIEquivalence-15',"

In full agreement here.

Regards,    Martin.


> >>The core choices from the view of IRIs are:
> >>
> >>a) 'character-by-character equivalence'
> >>    (taking a %hh-escaping as three characters)
> >>b) '%hh-escape equivalence' (equivalencing %hh-escape
> >>    sequences with the characters (based on US-ASCII/UTF-8)
> >>    they stand for (except for reserved characters!)
> >>
> >>The difference is more important for IRIs because the mapping
> >>from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because
> >>some protocols/formats/APIs will support IRIs whereas others
> >>(older/lower level) may not, having both escaped and unescaped
> >>versions of the same IRI is probably more frequent than for
> >>URIs (where %7E / ~ is the only example I have seen).
> >>This is a strong argument for %hh-escape equivalence.
> >>
> >>Because conversion from a URI to an IRI is not guaranteed to succeed,
> >>and even if it succeeds, is not guaranteed to produce the correct
> >>result (i.e. the original characters), it is important to convert
> >>from IRIs to URIs as late as possible. For %hh-escape equivalence,
> >>this means that %hh-escaping is only done for the actual comparison,
> >>but that the original IRI is always retained. This would need a
> >>certain amount of resources (time or space).
> >>
> >>The argument has been made that using character-by-character equivalence
> >>would create strong pressures to not convert from IRIs to URIs prematurely,
> >>which would be a good thing. It is difficult to judge whether this will
> >>be the case; if things go well, it may indeed provide desirable
> >>reinforcement, but if things go wrong, it may create additional confusion.
> >>
> >>It is thinkable to specify IRI equivalence by specifying character-by-
> >>character equivalence for ASCII characters, and %hh-escape equivalence
> >>for non-ascii characters. But the chance that this gets implemented
> >>is probably very low.
> >>
> >>
> >>The current version of the IRI draft
> >>(http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt)
> >>has been interpreted as prescribing %hh-escape equivalence,
> >>because the draft clearly says that an IRI and the URI that it
> >>is mapped to identify the same resource:
> >>
> >> >>>>
> >>2.3 Mapping of IRIs to URIs
> >>...
> >>    This mapping has two purposes:
> >>...
> >>       b) Interpretational: URIs identify resources in various ways.
> >>          IRIs also identify resources.  The resource that an IRI
> >>          identifies is the same as the one identified by the URI
> >>          obtained after converting the IRI according to the procedure
> >>          defined here.  This means that there is no need to define the
> >>          association between identifier and resource again on the IRI
> >>          level.
> >> >>>>
> >>
> >>But there is another interpretation: Because arbitrary URIs
> >>can identify the same resource, e.g.
> >>    http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
> >>    http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
> >>    http://www.w3.org/International/2002/draft-duerst-iri-00.txt
> >>all identify the same resource, without allowing to deduce that
> >>from their syntax, any artefacts (specs, software) that need to be
> >>able to identify two resources as the same will need a mechanism
> >>for doing that without relying on URI/IRI syntax anyway.
> >>In other words, resource identity and resource identifier equivalence
> >>are two different things.
> >>
> >>So for example, RDF could use character-for-character equivalence,
> >>and something such as daml:sameIndividualAs can be used to indicate
> >>that two URIs or IRIs refer to the same resource. It becomes then
> >>mainly an issue of careful wording, to make sure that readers
> >>do not confuse resource identifiers with resources.
> >>
> >>Anyway, we plan to adapt the wording in the IRI draft after
> >>the TAG decision, to reflect the decision and to make the
> >>implications clearer.
> >>
> >>In any case, it should be noted that while some specifications,
> >>such as XML Namespaces or RDF, have to choose a single definition
> >>of URI/IRI equivalence, other specifications and implementations
> >>may choose to exploit additional knowledge. For example, proxies
> >>will try to make as many assumptions as they can safely make
> >>to reduce misses. Also, specifications that are closely related
> >>to URI/IRI resolution may want to make similar assumptions.
> >>For an example, see RFC 2616 (HTTP 1.1), section 3.2.3.
> >>For another example, which specifically treats IRIs, see the XML
> >>Catalogs spec, in particular
> >>http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm.
> >>
> >>
> >>While the question of whether to treat %hh sequences equivalent to
> >>the characters they stand for or not is the most important aspec
> >>of URIEquivalence-15 for IRIs, there are other aspects.
> >>
> >>First, it should be explicitly noted that equivalence on the
> >>character level is applied after resolving different notations
> >>in a 'carrier' (host) representation. As an example,
> >>
> >>       xlink:href='http://www.w3.org'
> >>       xlink:href='http://www.w&#x33;.org'
> >>    must be the same.
> >>    [assuming xlink refers to the XLink namespace,
> >>     and knowing that U+0033 is the letter '3']
> >>
> >>This of course depends on the carrier (host) language;
> >>if you put http://www.w&#x33;.org into plain text email,
> >>that's not a legal URI, and not the same as http://www.w3.org.
> >>
> >>Second, in some cases casing equivalences can be relevant.
> >>In particular, the I18N WG has discussed whether e.g.
> >>       http://www.w3.org/XM%4C and
> >>       http://www.w3.org/XM%4c
> >>should be the same identifier, independently of whether this is
> >>the same identifier as
> >>       http://www.w3.org/XML
> >>There is an argument for making %4C and %4c the same, because
> >>there is no clear convention of using upper-case or lower case
> >>(in contrast to http:, where lower-case is dominant). Also, there
> >>is never ever any doubt that they would refer to different resources.
> >>
> >>In general, case equivalence for characters outside ASCII is
> >>language-dependent, and therefore should be avoided.
> >>
> >>
> >>
> >>The following contains collected language from all three URI
> >>RFCs showing that %hh-equivalence would be a valid choice:
> >>(I collected these quite a while ago, and wanted to make
> >>sure they are not missed.)
> >>
> >>    The current URI spec says:
> >>
> >>    http://www.ietf.org/rfc/rfc2396.txt, section 2.3:
> >>
> >>    >>>>
> >>    Unreserved characters can be escaped without changing the semantics
> >>    of the URI, but this should not be done unless the URI is being used
> >>    in a context that does not allow the unescaped character to appear.
> >>    >>>>
> >>
> >>    (to go directly to the relevant section:
> >>    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3)
> >>
> >>    [2.4.2. "When to Escape and Unescape", the escaping differences
> >>    for reserved characters are defined as scheme-specific.]
> >>
> >>    Earlier URI/URL specs say:
> >>
> >>    http://www.ietf.org/rfc/rfc1738.txt, section 2.2:
> >>
> >>    Usually a URL has the same interpretation when an octet is
> >>    represented by a character and when it encoded. However, this is not
> >>    true for reserved characters: encoding a character reserved for a
> >>    particular scheme may change the semantics of a URL.
> >>
> >>    (to go directly to the relevant section:
> >>    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2)
> >>
> >>    And from http://www.ietf.org/rfc/rfc1630.txt:
> >>
> >>    >>>
> >>       There is a conflict between the need to be able to represent many
> >>       characters including spaces within a URI directly, and the need to
> >>       be able to use a URI in environments which have limited character
> >>       sets or in which certain characters are prone to corruption.  This
> >>       conflict has been resolved by use of an hexadecimal escaping
> >>       method which may be applied to any characters forbidden in a given
> >>       context.  When URLs are moved between contexts, the set of
> >>       characters escaped may be enlarged or reduced unambiguously.
> >>
> >>    REDUCED OR INCREASED SAFE CHARACTER SETS
> >>
> >>       The same encoding method may be used for encoding characters whose
> >>       use, although technically allowed in a URI, would be unwise due to
> >>       problems of corruption by imperfect gateways or misrepresentation
> >>       due to the use of variant character sets, or which would simply be
> >>       awkward in a given environment.  Because a % sign always indicates
> >>       an encoded character, a URI may be made "safer" simply by encoding
> >>       any characters considered unsafe, while leaving already encoded
> >>       characters still encoded.  Similarly, in cases where a larger set
> >>       of characters is acceptable, % signs can be selectively and
> >>       reversibly expanded.
> >>
> >>       Before two URIs can be compared, it is therefore necessary to
> >>       bring them to the same encoding level.
> >>
> >>       However, the reserved characters mentioned above have a quite
> >>       different significance when encoded, and so may NEVER be encoded
> >>       and unencoded in this way.
> >>
> >>    ...
> >>
> >>    Example 1
> >>
> >>    The URIs
> >>
> >>                 http://info.cern.ch/albert/bertram/marie-claude
> >>
> >>    and
> >>
> >>                 http://info.cern.ch/albert/bertram/marie%2Dclaude
> >>
> >>    are identical, as the %2D encodes a hyphen character.
> >>    >>>>
> >>
> >>
> >>Regards,    Martin.
>
>
>
>--
>  Chris                            mailto:chris@w3.org
Received on Tuesday, 9 July 2002 11:07:43 UTC