- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 10 Jul 2002 00:04:26 +0900
- To: Chris Lilley <chris@w3.org>, www-tag@w3.org
- Cc: w3c-i18n-ig@w3.org
At 13:15 02/07/09 +0200, Chris Lilley wrote: >On Tuesday, July 9, 2002, 11:30:45 AM, Martin wrote: > > >MD> Dear TAG, > >MD> Misha has already said that there is a new version of the IRI >MD> draft; this is now also officially available at >MD> http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt. > >MD> I would like to draw your attention to Section 2.3, >MD> IRI Equivalence and Normalization, in particular to: > >MD> In some scenarios, such as XML Namespaces ([XMLNamespace]), a >MD> definite answer to the question of IRI equivalence is needed that is >MD> independent of the scheme used and always can be calculated quickly >MD> and without accessing a network. In such cases, two IRIs SHOULD be >MD> defined as equivalent if and only if they are character-by-character >MD> equivalent (which is the same as byte-by-byte equivalent if the >MD> character encoding for both IRIs is the same). In such a case, the >MD> comparison function MUST NOT map the IRIs to URIs. > >MD> Please note that this makes an explicit interpretation of >MD> 'character-by-character', according with what we understand >MD> to be current practice. > >So, it means that if normalization has ben done, two that look the >same will compare the same; if they have not, then the two might not >compare as equal and no software is going to fix that for you. > >And it (the paragraph quoted) means that ~ and %7E are not the same. Yes. I plan to add explicit examples tomorrow. >Whereas your text below seems to say that they are (because of late >conversion, just before conversion). Which text below? Can you explain/correct 'conversion before conversion'? >Please clarify which is correct. > >MD> We plan to some last edits on this document around July 22nd, >MD> and then plan to send it off to the IESG. We would be glad to >MD> change the above if the TAG decides that something different >MD> is needed, but we would need a decision fairly soon. > >MD> Many thanks in advance, Martin. > > >MD> At 19:10 02/05/27 +0900, Martin Duerst wrote: > >>Dear TAG, > >> > >>Here is my input on the issue of URI/IRI equivalence, for > >>your consideration. This is a very important issue for IRIs. > >> > >>First and foremost, while it's okay to call the issue > >>'URIEquivalence-15', its resolution should really be a solution > >>both for URI equivalence and for IRI equivalence. While the > >>choices are the same in both cases, IRIs bring in additional > >>considerations. > >Noted. I don't think the issue name needs to be changed as long as >that scope is clear. As I said: "while it's okay to call the issue 'URIEquivalence-15'," In full agreement here. Regards, Martin. > >>The core choices from the view of IRIs are: > >> > >>a) 'character-by-character equivalence' > >> (taking a %hh-escaping as three characters) > >>b) '%hh-escape equivalence' (equivalencing %hh-escape > >> sequences with the characters (based on US-ASCII/UTF-8) > >> they stand for (except for reserved characters!) > >> > >>The difference is more important for IRIs because the mapping > >>from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because > >>some protocols/formats/APIs will support IRIs whereas others > >>(older/lower level) may not, having both escaped and unescaped > >>versions of the same IRI is probably more frequent than for > >>URIs (where %7E / ~ is the only example I have seen). > >>This is a strong argument for %hh-escape equivalence. > >> > >>Because conversion from a URI to an IRI is not guaranteed to succeed, > >>and even if it succeeds, is not guaranteed to produce the correct > >>result (i.e. the original characters), it is important to convert > >>from IRIs to URIs as late as possible. For %hh-escape equivalence, > >>this means that %hh-escaping is only done for the actual comparison, > >>but that the original IRI is always retained. This would need a > >>certain amount of resources (time or space). > >> > >>The argument has been made that using character-by-character equivalence > >>would create strong pressures to not convert from IRIs to URIs prematurely, > >>which would be a good thing. It is difficult to judge whether this will > >>be the case; if things go well, it may indeed provide desirable > >>reinforcement, but if things go wrong, it may create additional confusion. > >> > >>It is thinkable to specify IRI equivalence by specifying character-by- > >>character equivalence for ASCII characters, and %hh-escape equivalence > >>for non-ascii characters. But the chance that this gets implemented > >>is probably very low. > >> > >> > >>The current version of the IRI draft > >>(http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt) > >>has been interpreted as prescribing %hh-escape equivalence, > >>because the draft clearly says that an IRI and the URI that it > >>is mapped to identify the same resource: > >> > >> >>>> > >>2.3 Mapping of IRIs to URIs > >>... > >> This mapping has two purposes: > >>... > >> b) Interpretational: URIs identify resources in various ways. > >> IRIs also identify resources. The resource that an IRI > >> identifies is the same as the one identified by the URI > >> obtained after converting the IRI according to the procedure > >> defined here. This means that there is no need to define the > >> association between identifier and resource again on the IRI > >> level. > >> >>>> > >> > >>But there is another interpretation: Because arbitrary URIs > >>can identify the same resource, e.g. > >> http://search.ietf.org/internet-drafts/draft-duerst-iri-00.txt and > >> http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt and > >> http://www.w3.org/International/2002/draft-duerst-iri-00.txt > >>all identify the same resource, without allowing to deduce that > >>from their syntax, any artefacts (specs, software) that need to be > >>able to identify two resources as the same will need a mechanism > >>for doing that without relying on URI/IRI syntax anyway. > >>In other words, resource identity and resource identifier equivalence > >>are two different things. > >> > >>So for example, RDF could use character-for-character equivalence, > >>and something such as daml:sameIndividualAs can be used to indicate > >>that two URIs or IRIs refer to the same resource. It becomes then > >>mainly an issue of careful wording, to make sure that readers > >>do not confuse resource identifiers with resources. > >> > >>Anyway, we plan to adapt the wording in the IRI draft after > >>the TAG decision, to reflect the decision and to make the > >>implications clearer. > >> > >>In any case, it should be noted that while some specifications, > >>such as XML Namespaces or RDF, have to choose a single definition > >>of URI/IRI equivalence, other specifications and implementations > >>may choose to exploit additional knowledge. For example, proxies > >>will try to make as many assumptions as they can safely make > >>to reduce misses. Also, specifications that are closely related > >>to URI/IRI resolution may want to make similar assumptions. > >>For an example, see RFC 2616 (HTTP 1.1), section 3.2.3. > >>For another example, which specifically treats IRIs, see the XML > >>Catalogs spec, in particular > >>http://oasis-open.org/committees/entity/spec-2001-08-06.html#sysid-norm. > >> > >> > >>While the question of whether to treat %hh sequences equivalent to > >>the characters they stand for or not is the most important aspec > >>of URIEquivalence-15 for IRIs, there are other aspects. > >> > >>First, it should be explicitly noted that equivalence on the > >>character level is applied after resolving different notations > >>in a 'carrier' (host) representation. As an example, > >> > >> xlink:href='http://www.w3.org' > >> xlink:href='http://www.w3.org' > >> must be the same. > >> [assuming xlink refers to the XLink namespace, > >> and knowing that U+0033 is the letter '3'] > >> > >>This of course depends on the carrier (host) language; > >>if you put http://www.w3.org into plain text email, > >>that's not a legal URI, and not the same as http://www.w3.org. > >> > >>Second, in some cases casing equivalences can be relevant. > >>In particular, the I18N WG has discussed whether e.g. > >> http://www.w3.org/XM%4C and > >> http://www.w3.org/XM%4c > >>should be the same identifier, independently of whether this is > >>the same identifier as > >> http://www.w3.org/XML > >>There is an argument for making %4C and %4c the same, because > >>there is no clear convention of using upper-case or lower case > >>(in contrast to http:, where lower-case is dominant). Also, there > >>is never ever any doubt that they would refer to different resources. > >> > >>In general, case equivalence for characters outside ASCII is > >>language-dependent, and therefore should be avoided. > >> > >> > >> > >>The following contains collected language from all three URI > >>RFCs showing that %hh-equivalence would be a valid choice: > >>(I collected these quite a while ago, and wanted to make > >>sure they are not missed.) > >> > >> The current URI spec says: > >> > >> http://www.ietf.org/rfc/rfc2396.txt, section 2.3: > >> > >> >>>> > >> Unreserved characters can be escaped without changing the semantics > >> of the URI, but this should not be done unless the URI is being used > >> in a context that does not allow the unescaped character to appear. > >> >>>> > >> > >> (to go directly to the relevant section: > >> http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html#sec-2.3) > >> > >> [2.4.2. "When to Escape and Unescape", the escaping differences > >> for reserved characters are defined as scheme-specific.] > >> > >> Earlier URI/URL specs say: > >> > >> http://www.ietf.org/rfc/rfc1738.txt, section 2.2: > >> > >> Usually a URL has the same interpretation when an octet is > >> represented by a character and when it encoded. However, this is not > >> true for reserved characters: encoding a character reserved for a > >> particular scheme may change the semantics of a URL. > >> > >> (to go directly to the relevant section: > >> http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1738.html#sec-2.2) > >> > >> And from http://www.ietf.org/rfc/rfc1630.txt: > >> > >> >>> > >> There is a conflict between the need to be able to represent many > >> characters including spaces within a URI directly, and the need to > >> be able to use a URI in environments which have limited character > >> sets or in which certain characters are prone to corruption. This > >> conflict has been resolved by use of an hexadecimal escaping > >> method which may be applied to any characters forbidden in a given > >> context. When URLs are moved between contexts, the set of > >> characters escaped may be enlarged or reduced unambiguously. > >> > >> REDUCED OR INCREASED SAFE CHARACTER SETS > >> > >> The same encoding method may be used for encoding characters whose > >> use, although technically allowed in a URI, would be unwise due to > >> problems of corruption by imperfect gateways or misrepresentation > >> due to the use of variant character sets, or which would simply be > >> awkward in a given environment. Because a % sign always indicates > >> an encoded character, a URI may be made "safer" simply by encoding > >> any characters considered unsafe, while leaving already encoded > >> characters still encoded. Similarly, in cases where a larger set > >> of characters is acceptable, % signs can be selectively and > >> reversibly expanded. > >> > >> Before two URIs can be compared, it is therefore necessary to > >> bring them to the same encoding level. > >> > >> However, the reserved characters mentioned above have a quite > >> different significance when encoded, and so may NEVER be encoded > >> and unencoded in this way. > >> > >> ... > >> > >> Example 1 > >> > >> The URIs > >> > >> http://info.cern.ch/albert/bertram/marie-claude > >> > >> and > >> > >> http://info.cern.ch/albert/bertram/marie%2Dclaude > >> > >> are identical, as the %2D encodes a hyphen character. > >> >>>> > >> > >> > >>Regards, Martin. > > > >-- > Chris mailto:chris@w3.org
Received on Tuesday, 9 July 2002 11:07:43 UTC