- From: Chris Lilley <chris@w3.org>
- Date: Tue, 9 Jul 2002 13:15:11 +0200
- To: www-tag@w3.org, Martin Duerst <duerst@w3.org>
- CC: w3c-i18n-ig@w3.org
On Tuesday, July 9, 2002, 11:30:45 AM, Martin wrote: MD> Dear TAG, MD> Misha has already said that there is a new version of the IRI MD> draft; this is now also officially available at MD> http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt. MD> I would like to draw your attention to Section 2.3, MD> IRI Equivalence and Normalization, in particular to: MD> In some scenarios, such as XML Namespaces ([XMLNamespace]), a MD> definite answer to the question of IRI equivalence is needed that is MD> independent of the scheme used and always can be calculated quickly MD> and without accessing a network. In such cases, two IRIs SHOULD be MD> defined as equivalent if and only if they are character-by-character MD> equivalent (which is the same as byte-by-byte equivalent if the MD> character encoding for both IRIs is the same). In such a case, the MD> comparison function MUST NOT map the IRIs to URIs. MD> Please note that this makes an explicit interpretation of MD> 'character-by-character', according with what we understand MD> to be current practice. So, it means that if normalization has ben done, two that look the same will compare the same; if they have not, then the two might not compare as equal and no software is going to fix that for you. And it (the paragraph quoted) means that ~ and %7E are not the same. Whereas your text below seems to say that they are (because of late conversion, just before conversion). Please clarify which is correct. MD> We plan to some last edits on this document around July 22nd, MD> and then plan to send it off to the IESG. We would be glad to MD> change the above if the TAG decides that something different MD> is needed, but we would need a decision fairly soon. MD> Many thanks in advance, Martin. MD> At 19:10 02/05/27 +0900, Martin Duerst wrote: >>Dear TAG, >> >>Here is my input on the issue of URI/IRI equivalence, for >>your consideration. This is a very important issue for IRIs. >> >>First and foremost, while it's okay to call the issue >>'URIEquivalence-15', its resolution should really be a solution >>both for URI equivalence and for IRI equivalence. While the >>choices are the same in both cases, IRIs bring in additional >>considerations. Noted. I don't think the issue name needs to be changed as long as that scope is clear. >>The core choices from the view of IRIs are: >> >>a) 'character-by-character equivalence' >> (taking a %hh-escaping as three characters) >>b) '%hh-escape equivalence' (equivalencing %hh-escape >> sequences with the characters (based on US-ASCII/UTF-8) >> they stand for (except for reserved characters!) >> >>The difference is more important for IRIs because the mapping >>from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because >>some protocols/formats/APIs will support IRIs whereas others >>(older/lower level) may not, having both escaped and unescaped >>versions of the same IRI is probably more frequent than for >>URIs (where %7E / ~ is the only example I have seen). >>This is a strong argument for %hh-escape equivalence. >> >>Because conversion from a URI to an IRI is not guaranteed to succeed, >>and even if it succeeds, is not guaranteed to produce the correct >>result (i.e. the original characters), it is important to convert >>from IRIs to URIs as late as possible. For %hh-escape equivalence, >>this means that %hh-escaping is only done for the actual comparison, >>but that the original IRI is always retained. This would need a >>certain amount of resources (time or space). >> >>The argument has been made that using character-by-character equivalence >>would create strong pressures to not convert from IRIs to URIs prematurely, >>which would be a good thing. It is difficult to judge whether this will >>be the case; if things go well, it may indeed provide desirable >>reinforcement, but if things go wrong, it may create additional confusion. >> >>It is thinkable to specify IRI equivalence by specifying character-by- >>character equivalence for ASCII characters, and %hh-escape equivalence >>for non-ascii characters. But the chance that this gets implemented >>is probably very low. >> >> >>The current version of the IRI draft >>(http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt) >>has been interpreted as prescribing %hh-escape equivalence, >>because the draft clearly says that an IRI and the URI that it >>is mapped to identify the same resource: >> >> >>>> >>2.3 Mapping of IRIs to URIs >>... >> This mapping has two purposes: >>... >> b) Interpretational: URIs identify resources in various ways. >> IRIs also identify resources. The resource that an IRI >> identifies is the same as the one identified by the URI >> obtained after converting the IRI according to the procedure >> defined here. This means that there is no need to define the >> association between identifier and resource again on the IRI >> level. >> >>>> >> >>But there is another interpretation: Because arbitrary URIs >>can identify the same resource, e.g. >> http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and >> http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and >> http://www.w3.org/International/2002/draft-duerst-iri-00.txt >>all identify the same resource, without allowing to deduce that >>from their syntax, any artefacts (specs, software) that need to be >>able to identify two resources as the same will need a mechanism >>for doing that without relying on URI/IRI syntax anyway. >>In other words, resource identity and resource identifier equivalence >>are two different things. >> >>So for example, RDF could use character-for-character equivalence, >>and something such as daml:sameIndividualAs can be used to indicate >>that two URIs or IRIs refer to the same resource. It becomes then >>mainly an issue of careful wording, to make sure that readers >>do not confuse resource identifiers with resources. >> >>Anyway, we plan to adapt the wording in the IRI draft after >>the TAG decision, to reflect the decision and to make the >>implications clearer. >> >>In any case, it should be noted that while some specifications, >>such as XML Namespaces or RDF, have to choose a single definition >>of URI/IRI equivalence, other specifications and implementations >>may choose to exploit additional knowledge. For example, proxies >>will try to make as many assumptions as they can safely make >>to reduce misses. Also, specifications that are closely related >>to URI/IRI resolution may want to make similar assumptions. >>For an example, see RFC 2616 (HTTP 1.1), section 3.2.3. >>For another example, which specifically treats IRIs, see the XML >>Catalogs spec, in particular >>http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm. >> >> >>While the question of whether to treat %hh sequences equivalent to >>the characters they stand for or not is the most important aspec >>of URIEquivalence-15 for IRIs, there are other aspects. >> >>First, it should be explicitly noted that equivalence on the >>character level is applied after resolving different notations >>in a 'carrier' (host) representation. As an example, >> >> xlink:href='http://www.w3.org' >> xlink:href='http://www.w3.org' >> must be the same. >> [assuming xlink refers to the XLink namespace, >> and knowing that U+0033 is the letter '3'] >> >>This of course depends on the carrier (host) language; >>if you put http://www.w3.org into plain text email, >>that's not a legal URI, and not the same as http://www.w3.org. >> >>Second, in some cases casing equivalences can be relevant. >>In particular, the I18N WG has discussed whether e.g. >> http://www.w3.org/XM%4C and >> http://www.w3.org/XM%4c >>should be the same identifier, independently of whether this is >>the same identifier as >> http://www.w3.org/XML >>There is an argument for making %4C and %4c the same, because >>there is no clear convention of using upper-case or lower case >>(in contrast to http:, where lower-case is dominant). Also, there >>is never ever any doubt that they would refer to different resources. >> >>In general, case equivalence for characters outside ASCII is >>language-dependent, and therefore should be avoided. >> >> >> >>The following contains collected language from all three URI >>RFCs showing that %hh-equivalence would be a valid choice: >>(I collected these quite a while ago, and wanted to make >>sure they are not missed.) >> >> The current URI spec says: >> >> http://www.ietf.org/rfc/rfc2396.txt, section 2.3: >> >> >>>> >> Unreserved characters can be escaped without changing the semantics >> of the URI, but this should not be done unless the URI is being used >> in a context that does not allow the unescaped character to appear. >> >>>> >> >> (to go directly to the relevant section: >> http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3) >> >> [2.4.2. "When to Escape and Unescape", the escaping differences >> for reserved characters are defined as scheme-specific.] >> >> Earlier URI/URL specs say: >> >> http://www.ietf.org/rfc/rfc1738.txt, section 2.2: >> >> Usually a URL has the same interpretation when an octet is >> represented by a character and when it encoded. However, this is not >> true for reserved characters: encoding a character reserved for a >> particular scheme may change the semantics of a URL. >> >> (to go directly to the relevant section: >> http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2) >> >> And from http://www.ietf.org/rfc/rfc1630.txt: >> >> >>> >> There is a conflict between the need to be able to represent many >> characters including spaces within a URI directly, and the need to >> be able to use a URI in environments which have limited character >> sets or in which certain characters are prone to corruption. This >> conflict has been resolved by use of an hexadecimal escaping >> method which may be applied to any characters forbidden in a given >> context. When URLs are moved between contexts, the set of >> characters escaped may be enlarged or reduced unambiguously. >> >> REDUCED OR INCREASED SAFE CHARACTER SETS >> >> The same encoding method may be used for encoding characters whose >> use, although technically allowed in a URI, would be unwise due to >> problems of corruption by imperfect gateways or misrepresentation >> due to the use of variant character sets, or which would simply be >> awkward in a given environment. Because a % sign always indicates >> an encoded character, a URI may be made "safer" simply by encoding >> any characters considered unsafe, while leaving already encoded >> characters still encoded. Similarly, in cases where a larger set >> of characters is acceptable, % signs can be selectively and >> reversibly expanded. >> >> Before two URIs can be compared, it is therefore necessary to >> bring them to the same encoding level. >> >> However, the reserved characters mentioned above have a quite >> different significance when encoded, and so may NEVER be encoded >> and unencoded in this way. >> >> ... >> >> Example 1 >> >> The URIs >> >> http://info.cern.ch/albert/bertram/marie-claude >> >> and >> >> http://info.cern.ch/albert/bertram/marie%2Dclaude >> >> are identical, as the %2D encodes a hyphen character. >> >>>> >> >> >>Regards, Martin. -- Chris mailto:chris@w3.org
Received on Tuesday, 9 July 2002 07:15:34 UTC