- From: Williams, Stuart <skw@hplb.hpl.hp.com>
- Date: Mon, 6 Jan 2003 17:38:09 -0000
- To: "'Tim Bray'" <tbray@textuality.com>, Stefan Eissing <stefan.eissing@greenbytes.de>
- Cc: WWW-Tag <www-tag@w3.org>
Hi Tim, I have finally gotten round to reading your draft [1]. As Stefan said... "Excellent read". A few thoughts/comments: 1) I wondered whether the introduction should say more about equivalence. It very quickly gets into talking about comparisons and refers to the outcome of a comparison as "equivalent" or "different". I find myself thinking of equivalence as a type of relation (reflexive, symmetric, transitive) between URI (URI References?) and that given some set of URI, different equivalence relations would partition the set differently. Maybe something like: "URI are equivalent with respect to some purpose. The strongest equivalence relation is identity and arises between URI that are the same, character-by-character. Other equivalence relations arises in a context dependent way eg. two URI may be equivalent for purpose of retrieving representations of a resource eg http://example.com and http://example.com:80, but not for the purpose of naming a namespace. These other equivalence relations respect the identity relation in that if two URI are identical they remain equivalent under these other equivalence relations." 2) On the topic of %-escape encoding, which I continue to find confusing despite the opening sentence in RFC 2396 section 2.1. RFC 2396 appears to delgate the 'URI Character -> octet' mapping to the URI scheme definition. The 4th Paragraph of Sec 2.1 begins: "A URI scheme may define a mapping from URI characters to octets; whether this is done depends on the scheme." Then, regarding the second mapping RFC 2396 speaks of 'octets -> original characters': "A charset defines this mapping." RFC2396 states "However, there is currently no provision within the generic URI syntax to accomplish this identification." It then offers possible options including delegation of charset default and/or selection mechanism to URI scheme definition. The URI Scheme registration template RFC2717 includes a field for "character encoding consideration". However, on a quick scan of the scheme registrations referenced from http://www.iana.org/assignments/uri-schemes I couldn't find any that offered any "character encoding consideration" :-) However, I think that there is an upside. Even if the first URI character -> octet mapping is scheme dependent, I think that one can be confident that for all %xx, for http://example.com/%xx and http://example.com/%xx, the octet sequences arising from the first mapping will be identical because the same scheme is in use. It's less clear that the second mapping, the charset which maps octets to original characters, is going to be the same in all contexts (like some of the forms examples)... however, in a given context... http://example.com/%xx will be equivalent to itself (surely!). 3) The three levels of 'URI Characters', 'octets' and 'original characters' discussed in 2396 seems to suggest that an octet-by-octet and 'original character-by-original character' (modulo charset selection issues) comparison of http://example.com/%61 and http://example.com/a would each make them equivalent, where-as a 'URI Character-by-URI character' comparison would make them different. This leaves me confused about when we speak of 'character-by-character' comparison whether we are speaking of 'URI Characters' or 'original characters'. That said, I also struggled with the terms 'URI character' and 'original character' and may be confused about them too. Hmmm... not sure any of this is helpful. It (URI Equivalence that is) all seems much more complicated than it ought to be. I kind of like the operational notions of equivalence, which I think is where Larry has been coming from in the past, such that in some context of use two URI are equivalent if one can be substituted for the other and give rise to equivalent results (effects and side effects). Regards Stuart > -----Original Message----- > From: Tim Bray [mailto:tbray@textuality.com] > Sent: 13 December 2002 15:28 > To: Stefan Eissing > Cc: WWW-Tag > Subject: Re: Draft 2 of "How to Compare URIs" > > > > Stefan Eissing wrote: > > > RFC 2396 Ch. 2.1 > > > > " In the simplest case, the original character sequence > contains only > > characters that are defined in US-ASCII, and the two levels > of mapping > > are simple and easily invertible: each 'original character' is > > represented as the octet for the US-ASCII code for it, which is, in > > turn, represented as either the US-ASCII character, or else the "%" > > escape sequence for that octet." > > You're saying you read this as "all characters in the ASCII > range must > use the ASCII codepoints for character->octet"? I guess that's > plausible, but I had read 2.1 to say "there are many character->octet > mappings, one of the simplest being that for ASCII chracters". And > assuming you're right, it still seems like there's a window > open here, > if you're operating in a non-ASCII environment then the char->octet > mapping is left 100% undefined, so you can't know whether %xx > == %xx for > all %xx > 0x7f. -Tim >
Received on Monday, 6 January 2003 12:42:37 UTC