- From: <Misha.Wolf@reuters.com>
- Date: Tue, 09 Jul 2002 12:09:44 +0100
- To: Martin Duerst <duerst@w3.org>
- Cc: w3c-i18n-ig@w3.org, www-tag@w3.org
Hi Martin, I think the IRI spec [1] should state explicitly that by "character-by- character equivalent" we mean that all of these (taken from a para a bit further on) are different: - foo://example.com/XML - foo://example.com/XM%4C - foo://example.com/XM%4c After all, the Namespaces spec [2] states that: [Definition:] URI references which identify namespaces are considered identical when they are exactly the same character-for-character. and there has been discussion of what exactly this means. Just repeating it won't, IMO, clear up the confusion. [1] http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt [2] http://www.w3.org/TR/REC-xml-names Thanks, Misha On 09/07/2002 10:30:45 Martin Duerst wrote: > Dear TAG, > > Misha has already said that there is a new version of the IRI > draft; this is now also officially available at > http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt. > > I would like to draw your attention to Section 2.3, > IRI Equivalence and Normalization, in particular to: > > In some scenarios, such as XML Namespaces ([XMLNamespace]), a > definite answer to the question of IRI equivalence is needed that is > independent of the scheme used and always can be calculated quickly > and without accessing a network. In such cases, two IRIs SHOULD be > defined as equivalent if and only if they are character-by-character > equivalent (which is the same as byte-by-byte equivalent if the > character encoding for both IRIs is the same). In such a case, the > comparison function MUST NOT map the IRIs to URIs. > > Please note that this makes an explicit interpretation of > 'character-by-character', according with what we understand > to be current practice. > > We plan to some last edits on this document around July 22nd, > and then plan to send it off to the IESG. We would be glad to > change the above if the TAG decides that something different > is needed, but we would need a decision fairly soon. > > Many thanks in advance, Martin. > > > At 19:10 02/05/27 +0900, Martin Duerst wrote: > >Dear TAG, > > > >Here is my input on the issue of URI/IRI equivalence, for > >your consideration. This is a very important issue for IRIs. > > > >First and foremost, while it's okay to call the issue > >'URIEquivalence-15', its resolution should really be a solution > >both for URI equivalence and for IRI equivalence. While the > >choices are the same in both cases, IRIs bring in additional > >considerations. > > > >The core choices from the view of IRIs are: > > > >a) 'character-by-character equivalence' > > (taking a %hh-escaping as three characters) > >b) '%hh-escape equivalence' (equivalencing %hh-escape > > sequences with the characters (based on US-ASCII/UTF-8) > > they stand for (except for reserved characters!) > > > >The difference is more important for IRIs because the mapping > >from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because > >some protocols/formats/APIs will support IRIs whereas others > >(older/lower level) may not, having both escaped and unescaped > >versions of the same IRI is probably more frequent than for > >URIs (where %7E / ~ is the only example I have seen). > >This is a strong argument for %hh-escape equivalence. > > > >Because conversion from a URI to an IRI is not guaranteed to succeed, > >and even if it succeeds, is not guaranteed to produce the correct > >result (i.e. the original characters), it is important to convert > >from IRIs to URIs as late as possible. For %hh-escape equivalence, > >this means that %hh-escaping is only done for the actual comparison, > >but that the original IRI is always retained. This would need a > >certain amount of resources (time or space). > > > >The argument has been made that using character-by-character equivalence > >would create strong pressures to not convert from IRIs to URIs prematurely, > >which would be a good thing. It is difficult to judge whether this will > >be the case; if things go well, it may indeed provide desirable > >reinforcement, but if things go wrong, it may create additional confusion. > > > >It is thinkable to specify IRI equivalence by specifying character-by- > >character equivalence for ASCII characters, and %hh-escape equivalence > >for non-ascii characters. But the chance that this gets implemented > >is probably very low. > > > > > >The current version of the IRI draft > >(http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt) > >has been interpreted as prescribing %hh-escape equivalence, > >because the draft clearly says that an IRI and the URI that it > >is mapped to identify the same resource: > > > > >>>> > >2.3 Mapping of IRIs to URIs > >... > > This mapping has two purposes: > >... > > b) Interpretational: URIs identify resources in various ways. > > IRIs also identify resources. The resource that an IRI > > identifies is the same as the one identified by the URI > > obtained after converting the IRI according to the procedure > > defined here. This means that there is no need to define the > > association between identifier and resource again on the IRI > > level. > > >>>> > > > >But there is another interpretation: Because arbitrary URIs > >can identify the same resource, e.g. > > http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and > > http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and > > http://www.w3.org/International/2002/draft-duerst-iri-00.txt > >all identify the same resource, without allowing to deduce that > >from their syntax, any artefacts (specs, software) that need to be > >able to identify two resources as the same will need a mechanism > >for doing that without relying on URI/IRI syntax anyway. > >In other words, resource identity and resource identifier equivalence > >are two different things. > > > >So for example, RDF could use character-for-character equivalence, > >and something such as daml:sameIndividualAs can be used to indicate > >that two URIs or IRIs refer to the same resource. It becomes then > >mainly an issue of careful wording, to make sure that readers > >do not confuse resource identifiers with resources. > > > >Anyway, we plan to adapt the wording in the IRI draft after > >the TAG decision, to reflect the decision and to make the > >implications clearer. > > > >In any case, it should be noted that while some specifications, > >such as XML Namespaces or RDF, have to choose a single definition > >of URI/IRI equivalence, other specifications and implementations > >may choose to exploit additional knowledge. For example, proxies > >will try to make as many assumptions as they can safely make > >to reduce misses. Also, specifications that are closely related > >to URI/IRI resolution may want to make similar assumptions. > >For an example, see RFC 2616 (HTTP 1.1), section 3.2.3. > >For another example, which specifically treats IRIs, see the XML > >Catalogs spec, in particular > >http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm. > > > > > >While the question of whether to treat %hh sequences equivalent to > >the characters they stand for or not is the most important aspec > >of URIEquivalence-15 for IRIs, there are other aspects. > > > >First, it should be explicitly noted that equivalence on the > >character level is applied after resolving different notations > >in a 'carrier' (host) representation. As an example, > > > > xlink:href='http://www.w3.org' > > xlink:href='http://www.w3.org' > > must be the same. > > [assuming xlink refers to the XLink namespace, > > and knowing that U+0033 is the letter '3'] > > > >This of course depends on the carrier (host) language; > >if you put http://www.w3.org into plain text email, > >that's not a legal URI, and not the same as http://www.w3.org. > > > >Second, in some cases casing equivalences can be relevant. > >In particular, the I18N WG has discussed whether e.g. > > http://www.w3.org/XM%4C and > > http://www.w3.org/XM%4c > >should be the same identifier, independently of whether this is > >the same identifier as > > http://www.w3.org/XML > >There is an argument for making %4C and %4c the same, because > >there is no clear convention of using upper-case or lower case > >(in contrast to http:, where lower-case is dominant). Also, there > >is never ever any doubt that they would refer to different resources. > > > >In general, case equivalence for characters outside ASCII is > >language-dependent, and therefore should be avoided. > > > > > > > >The following contains collected language from all three URI > >RFCs showing that %hh-equivalence would be a valid choice: > >(I collected these quite a while ago, and wanted to make > >sure they are not missed.) > > > > The current URI spec says: > > > > http://www.ietf.org/rfc/rfc2396.txt, section 2.3: > > > > >>>> > > Unreserved characters can be escaped without changing the semantics > > of the URI, but this should not be done unless the URI is being used > > in a context that does not allow the unescaped character to appear. > > >>>> > > > > (to go directly to the relevant section: > > http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3) > > > > [2.4.2. "When to Escape and Unescape", the escaping differences > > for reserved characters are defined as scheme-specific.] > > > > Earlier URI/URL specs say: > > > > http://www.ietf.org/rfc/rfc1738.txt, section 2.2: > > > > Usually a URL has the same interpretation when an octet is > > represented by a character and when it encoded. However, this is not > > true for reserved characters: encoding a character reserved for a > > particular scheme may change the semantics of a URL. > > > > (to go directly to the relevant section: > > http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2) > > > > And from http://www.ietf.org/rfc/rfc1630.txt: > > > > >>> > > There is a conflict between the need to be able to represent many > > characters including spaces within a URI directly, and the need to > > be able to use a URI in environments which have limited character > > sets or in which certain characters are prone to corruption. This > > conflict has been resolved by use of an hexadecimal escaping > > method which may be applied to any characters forbidden in a given > > context. When URLs are moved between contexts, the set of > > characters escaped may be enlarged or reduced unambiguously. > > > > REDUCED OR INCREASED SAFE CHARACTER SETS > > > > The same encoding method may be used for encoding characters whose > > use, although technically allowed in a URI, would be unwise due to > > problems of corruption by imperfect gateways or misrepresentation > > due to the use of variant character sets, or which would simply be > > awkward in a given environment. Because a % sign always indicates > > an encoded character, a URI may be made "safer" simply by encoding > > any characters considered unsafe, while leaving already encoded > > characters still encoded. Similarly, in cases where a larger set > > of characters is acceptable, % signs can be selectively and > > reversibly expanded. > > > > Before two URIs can be compared, it is therefore necessary to > > bring them to the same encoding level. > > > > However, the reserved characters mentioned above have a quite > > different significance when encoded, and so may NEVER be encoded > > and unencoded in this way. > > > > ... > > > > Example 1 > > > > The URIs > > > > http://info.cern.ch/albert/bertram/marie-claude > > > > and > > > > http://info.cern.ch/albert/bertram/marie%2Dclaude > > > > are identical, as the %2D encodes a hyphen character. > > >>>> > > > > > >Regards, Martin. > -------------------------------------------------------------- -- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
Received on Tuesday, 9 July 2002 07:12:44 UTC