- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 27 May 2002 19:10:54 +0900
- To: www-tag@w3.org
- Cc: w3c-i18n-ig@w3.org
Dear TAG, Here is my input on the issue of URI/IRI equivalence, for your consideration. This is a very important issue for IRIs. First and foremost, while it's okay to call the issue 'URIEquivalence-15', its resolution should really be a solution both for URI equivalence and for IRI equivalence. While the choices are the same in both cases, IRIs bring in additional considerations. The core choices from the view of IRIs are: a) 'character-by-character equivalence' (taking a %hh-escaping as three characters) b) '%hh-escape equivalence' (equivalencing %hh-escape sequences with the characters (based on US-ASCII/UTF-8) they stand for (except for reserved characters!) The difference is more important for IRIs because the mapping from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because some protocols/formats/APIs will support IRIs whereas others (older/lower level) may not, having both escaped and unescaped versions of the same IRI is probably more frequent than for URIs (where %7E / ~ is the only example I have seen). This is a strong argument for %hh-escape equivalence. Because conversion from a URI to an IRI is not guaranteed to succeed, and even if it succeeds, is not guaranteed to produce the correct result (i.e. the original characters), it is important to convert from IRIs to URIs as late as possible. For %hh-escape equivalence, this means that %hh-escaping is only done for the actual comparison, but that the original IRI is always retained. This would need a certain amount of resources (time or space). The argument has been made that using character-by-character equivalence would create strong pressures to not convert from IRIs to URIs prematurely, which would be a good thing. It is difficult to judge whether this will be the case; if things go well, it may indeed provide desirable reinforcement, but if things go wrong, it may create additional confusion. It is thinkable to specify IRI equivalence by specifying character-by- character equivalence for ASCII characters, and %hh-escape equivalence for non-ascii characters. But the chance that this gets implemented is probably very low. The current version of the IRI draft (http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt) has been interpreted as prescribing %hh-escape equivalence, because the draft clearly says that an IRI and the URI that it is mapped to identify the same resource: >>>> 2.3 Mapping of IRIs to URIs ... This mapping has two purposes: ... b) Interpretational: URIs identify resources in various ways. IRIs also identify resources. The resource that an IRI identifies is the same as the one identified by the URI obtained after converting the IRI according to the procedure defined here. This means that there is no need to define the association between identifier and resource again on the IRI level. >>>> But there is another interpretation: Because arbitrary URIs can identify the same resource, e.g. http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and http://www.w3.org/International/2002/draft-duerst-iri-00.txt all identify the same resource, without allowing to deduce that from their syntax, any artefacts (specs, software) that need to be able to identify two resources as the same will need a mechanism for doing that without relying on URI/IRI syntax anyway. In other words, resource identity and resource identifier equivalence are two different things. So for example, RDF could use character-for-character equivalence, and something such as daml:sameIndividualAs can be used to indicate that two URIs or IRIs refer to the same resource. It becomes then mainly an issue of careful wording, to make sure that readers do not confuse resource identifiers with resources. Anyway, we plan to adapt the wording in the IRI draft after the TAG decision, to reflect the decision and to make the implications clearer. In any case, it should be noted that while some specifications, such as XML Namespaces or RDF, have to choose a single definition of URI/IRI equivalence, other specifications and implementations may choose to exploit additional knowledge. For example, proxies will try to make as many assumptions as they can safely make to reduce misses. Also, specifications that are closely related to URI/IRI resolution may want to make similar assumptions. For an example, see RFC 2616 (HTTP 1.1), section 3.2.3. For another example, which specifically treats IRIs, see the XML Catalogs spec, in particular http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm. While the question of whether to treat %hh sequences equivalent to the characters they stand for or not is the most important aspec of URIEquivalence-15 for IRIs, there are other aspects. First, it should be explicitly noted that equivalence on the character level is applied after resolving different notations in a 'carrier' (host) representation. As an example, xlink:href='http://www.w3.org' xlink:href='http://www.w3.org' must be the same. [assuming xlink refers to the XLink namespace, and knowing that U+0033 is the letter '3'] This of course depends on the carrier (host) language; if you put http://www.w3.org into plain text email, that's not a legal URI, and not the same as http://www.w3.org. Second, in some cases casing equivalences can be relevant. In particular, the I18N WG has discussed whether e.g. http://www.w3.org/XM%4C and http://www.w3.org/XM%4c should be the same identifier, independently of whether this is the same identifier as http://www.w3.org/XML There is an argument for making %4C and %4c the same, because there is no clear convention of using upper-case or lower case (in contrast to http:, where lower-case is dominant). Also, there is never ever any doubt that they would refer to different resources. In general, case equivalence for characters outside ASCII is language-dependent, and therefore should be avoided. The following contains collected language from all three URI RFCs showing that %hh-equivalence would be a valid choice: (I collected these quite a while ago, and wanted to make sure they are not missed.) The current URI spec says: http://www.ietf.org/rfc/rfc2396.txt, section 2.3: >>>> Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear. >>>> (to go directly to the relevant section: http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3) [2.4.2. "When to Escape and Unescape", the escaping differences for reserved characters are defined as scheme-specific.] Earlier URI/URL specs say: http://www.ietf.org/rfc/rfc1738.txt, section 2.2: Usually a URL has the same interpretation when an octet is represented by a character and when it encoded. However, this is not true for reserved characters: encoding a character reserved for a particular scheme may change the semantics of a URL. (to go directly to the relevant section: http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2) And from http://www.ietf.org/rfc/rfc1630.txt: >>> There is a conflict between the need to be able to represent many characters including spaces within a URI directly, and the need to be able to use a URI in environments which have limited character sets or in which certain characters are prone to corruption. This conflict has been resolved by use of an hexadecimal escaping method which may be applied to any characters forbidden in a given context. When URLs are moved between contexts, the set of characters escaped may be enlarged or reduced unambiguously. REDUCED OR INCREASED SAFE CHARACTER SETS The same encoding method may be used for encoding characters whose use, although technically allowed in a URI, would be unwise due to problems of corruption by imperfect gateways or misrepresentation due to the use of variant character sets, or which would simply be awkward in a given environment. Because a % sign always indicates an encoded character, a URI may be made "safer" simply by encoding any characters considered unsafe, while leaving already encoded characters still encoded. Similarly, in cases where a larger set of characters is acceptable, % signs can be selectively and reversibly expanded. Before two URIs can be compared, it is therefore necessary to bring them to the same encoding level. However, the reserved characters mentioned above have a quite different significance when encoded, and so may NEVER be encoded and unencoded in this way. ... Example 1 The URIs http://info.cern.ch/albert/bertram/marie-claude and http://info.cern.ch/albert/bertram/marie%2Dclaude are identical, as the %2D encodes a hyphen character. >>>> Regards, Martin.
Received on Monday, 27 May 2002 07:46:21 UTC