- From: <Misha.Wolf@reuters.com>
- Date: Tue, 09 Jul 2002 12:09:44 +0100
- To: Martin Duerst <duerst@w3.org>
- Cc: w3c-i18n-ig@w3.org, www-tag@w3.org
Hi Martin,
I think the IRI spec [1] should state explicitly that by "character-by-
character equivalent" we mean that all of these (taken from a para a bit
further on) are different:
- foo://example.com/XML
- foo://example.com/XM%4C
- foo://example.com/XM%4c
After all, the Namespaces spec [2] states that:
[Definition:] URI references which identify namespaces are considered
identical when they are exactly the same character-for-character.
and there has been discussion of what exactly this means. Just repeating
it won't, IMO, clear up the confusion.
[1] http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt
[2] http://www.w3.org/TR/REC-xml-names
Thanks,
Misha
On 09/07/2002 10:30:45 Martin Duerst wrote:
> Dear TAG,
>
> Misha has already said that there is a new version of the IRI
> draft; this is now also officially available at
> http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt.
>
> I would like to draw your attention to Section 2.3,
> IRI Equivalence and Normalization, in particular to:
>
> In some scenarios, such as XML Namespaces ([XMLNamespace]), a
> definite answer to the question of IRI equivalence is needed that is
> independent of the scheme used and always can be calculated quickly
> and without accessing a network. In such cases, two IRIs SHOULD be
> defined as equivalent if and only if they are character-by-character
> equivalent (which is the same as byte-by-byte equivalent if the
> character encoding for both IRIs is the same). In such a case, the
> comparison function MUST NOT map the IRIs to URIs.
>
> Please note that this makes an explicit interpretation of
> 'character-by-character', according with what we understand
> to be current practice.
>
> We plan to some last edits on this document around July 22nd,
> and then plan to send it off to the IESG. We would be glad to
> change the above if the TAG decides that something different
> is needed, but we would need a decision fairly soon.
>
> Many thanks in advance, Martin.
>
>
> At 19:10 02/05/27 +0900, Martin Duerst wrote:
> >Dear TAG,
> >
> >Here is my input on the issue of URI/IRI equivalence, for
> >your consideration. This is a very important issue for IRIs.
> >
> >First and foremost, while it's okay to call the issue
> >'URIEquivalence-15', its resolution should really be a solution
> >both for URI equivalence and for IRI equivalence. While the
> >choices are the same in both cases, IRIs bring in additional
> >considerations.
> >
> >The core choices from the view of IRIs are:
> >
> >a) 'character-by-character equivalence'
> > (taking a %hh-escaping as three characters)
> >b) '%hh-escape equivalence' (equivalencing %hh-escape
> > sequences with the characters (based on US-ASCII/UTF-8)
> > they stand for (except for reserved characters!)
> >
> >The difference is more important for IRIs because the mapping
> >from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because
> >some protocols/formats/APIs will support IRIs whereas others
> >(older/lower level) may not, having both escaped and unescaped
> >versions of the same IRI is probably more frequent than for
> >URIs (where %7E / ~ is the only example I have seen).
> >This is a strong argument for %hh-escape equivalence.
> >
> >Because conversion from a URI to an IRI is not guaranteed to succeed,
> >and even if it succeeds, is not guaranteed to produce the correct
> >result (i.e. the original characters), it is important to convert
> >from IRIs to URIs as late as possible. For %hh-escape equivalence,
> >this means that %hh-escaping is only done for the actual comparison,
> >but that the original IRI is always retained. This would need a
> >certain amount of resources (time or space).
> >
> >The argument has been made that using character-by-character equivalence
> >would create strong pressures to not convert from IRIs to URIs prematurely,
> >which would be a good thing. It is difficult to judge whether this will
> >be the case; if things go well, it may indeed provide desirable
> >reinforcement, but if things go wrong, it may create additional confusion.
> >
> >It is thinkable to specify IRI equivalence by specifying character-by-
> >character equivalence for ASCII characters, and %hh-escape equivalence
> >for non-ascii characters. But the chance that this gets implemented
> >is probably very low.
> >
> >
> >The current version of the IRI draft
> >(http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt)
> >has been interpreted as prescribing %hh-escape equivalence,
> >because the draft clearly says that an IRI and the URI that it
> >is mapped to identify the same resource:
> >
> > >>>>
> >2.3 Mapping of IRIs to URIs
> >...
> > This mapping has two purposes:
> >...
> > b) Interpretational: URIs identify resources in various ways.
> > IRIs also identify resources. The resource that an IRI
> > identifies is the same as the one identified by the URI
> > obtained after converting the IRI according to the procedure
> > defined here. This means that there is no need to define the
> > association between identifier and resource again on the IRI
> > level.
> > >>>>
> >
> >But there is another interpretation: Because arbitrary URIs
> >can identify the same resource, e.g.
> > http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
> > http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and
> > http://www.w3.org/International/2002/draft-duerst-iri-00.txt
> >all identify the same resource, without allowing to deduce that
> >from their syntax, any artefacts (specs, software) that need to be
> >able to identify two resources as the same will need a mechanism
> >for doing that without relying on URI/IRI syntax anyway.
> >In other words, resource identity and resource identifier equivalence
> >are two different things.
> >
> >So for example, RDF could use character-for-character equivalence,
> >and something such as daml:sameIndividualAs can be used to indicate
> >that two URIs or IRIs refer to the same resource. It becomes then
> >mainly an issue of careful wording, to make sure that readers
> >do not confuse resource identifiers with resources.
> >
> >Anyway, we plan to adapt the wording in the IRI draft after
> >the TAG decision, to reflect the decision and to make the
> >implications clearer.
> >
> >In any case, it should be noted that while some specifications,
> >such as XML Namespaces or RDF, have to choose a single definition
> >of URI/IRI equivalence, other specifications and implementations
> >may choose to exploit additional knowledge. For example, proxies
> >will try to make as many assumptions as they can safely make
> >to reduce misses. Also, specifications that are closely related
> >to URI/IRI resolution may want to make similar assumptions.
> >For an example, see RFC 2616 (HTTP 1.1), section 3.2.3.
> >For another example, which specifically treats IRIs, see the XML
> >Catalogs spec, in particular
> >http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm.
> >
> >
> >While the question of whether to treat %hh sequences equivalent to
> >the characters they stand for or not is the most important aspec
> >of URIEquivalence-15 for IRIs, there are other aspects.
> >
> >First, it should be explicitly noted that equivalence on the
> >character level is applied after resolving different notations
> >in a 'carrier' (host) representation. As an example,
> >
> > xlink:href='http://www.w3.org'
> > xlink:href='http://www.w3.org'
> > must be the same.
> > [assuming xlink refers to the XLink namespace,
> > and knowing that U+0033 is the letter '3']
> >
> >This of course depends on the carrier (host) language;
> >if you put http://www.w3.org into plain text email,
> >that's not a legal URI, and not the same as http://www.w3.org.
> >
> >Second, in some cases casing equivalences can be relevant.
> >In particular, the I18N WG has discussed whether e.g.
> > http://www.w3.org/XM%4C and
> > http://www.w3.org/XM%4c
> >should be the same identifier, independently of whether this is
> >the same identifier as
> > http://www.w3.org/XML
> >There is an argument for making %4C and %4c the same, because
> >there is no clear convention of using upper-case or lower case
> >(in contrast to http:, where lower-case is dominant). Also, there
> >is never ever any doubt that they would refer to different resources.
> >
> >In general, case equivalence for characters outside ASCII is
> >language-dependent, and therefore should be avoided.
> >
> >
> >
> >The following contains collected language from all three URI
> >RFCs showing that %hh-equivalence would be a valid choice:
> >(I collected these quite a while ago, and wanted to make
> >sure they are not missed.)
> >
> > The current URI spec says:
> >
> > http://www.ietf.org/rfc/rfc2396.txt, section 2.3:
> >
> > >>>>
> > Unreserved characters can be escaped without changing the semantics
> > of the URI, but this should not be done unless the URI is being used
> > in a context that does not allow the unescaped character to appear.
> > >>>>
> >
> > (to go directly to the relevant section:
> > http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3)
> >
> > [2.4.2. "When to Escape and Unescape", the escaping differences
> > for reserved characters are defined as scheme-specific.]
> >
> > Earlier URI/URL specs say:
> >
> > http://www.ietf.org/rfc/rfc1738.txt, section 2.2:
> >
> > Usually a URL has the same interpretation when an octet is
> > represented by a character and when it encoded. However, this is not
> > true for reserved characters: encoding a character reserved for a
> > particular scheme may change the semantics of a URL.
> >
> > (to go directly to the relevant section:
> > http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2)
> >
> > And from http://www.ietf.org/rfc/rfc1630.txt:
> >
> > >>>
> > There is a conflict between the need to be able to represent many
> > characters including spaces within a URI directly, and the need to
> > be able to use a URI in environments which have limited character
> > sets or in which certain characters are prone to corruption. This
> > conflict has been resolved by use of an hexadecimal escaping
> > method which may be applied to any characters forbidden in a given
> > context. When URLs are moved between contexts, the set of
> > characters escaped may be enlarged or reduced unambiguously.
> >
> > REDUCED OR INCREASED SAFE CHARACTER SETS
> >
> > The same encoding method may be used for encoding characters whose
> > use, although technically allowed in a URI, would be unwise due to
> > problems of corruption by imperfect gateways or misrepresentation
> > due to the use of variant character sets, or which would simply be
> > awkward in a given environment. Because a % sign always indicates
> > an encoded character, a URI may be made "safer" simply by encoding
> > any characters considered unsafe, while leaving already encoded
> > characters still encoded. Similarly, in cases where a larger set
> > of characters is acceptable, % signs can be selectively and
> > reversibly expanded.
> >
> > Before two URIs can be compared, it is therefore necessary to
> > bring them to the same encoding level.
> >
> > However, the reserved characters mentioned above have a quite
> > different significance when encoded, and so may NEVER be encoded
> > and unencoded in this way.
> >
> > ...
> >
> > Example 1
> >
> > The URIs
> >
> > http://info.cern.ch/albert/bertram/marie-claude
> >
> > and
> >
> > http://info.cern.ch/albert/bertram/marie%2Dclaude
> >
> > are identical, as the %2D encodes a hyphen character.
> > >>>>
> >
> >
> >Regards, Martin.
>
-------------------------------------------------------------- --
Visit our Internet site at http://www.reuters.com
Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.
Received on Tuesday, 9 July 2002 07:12:44 UTC