- From: Tim Bray <tbray@textuality.com>
- Date: Wed, 11 Dec 2002 09:30:00 -0800
- To: Martin Duerst <duerst@w3.org>
- Cc: WWW-Tag <www-tag@w3.org>
Martin Duerst wrote: > - "Since the world contains many characters useful in identifying resources > beyond those in US-ASCII, and since the special characters such as > ':' and > '/' are also often useful, RFC2396 provides a mechanism for > "%-escaping" such > characters; they are represented as a sequence of 2-digit hexadecimal > codes, > each representing the value of one byte and preceded by the percent > sign '%'." > > This assumes 1 character == 1 byte, and a direct character -> %hh mapping, > which is clearly not the case. See section 2.1 of > http://www.ietf.org/rfc/rfc2396.txt. I just did (for the 87th time). I will reword slightly to point out that octets represent characters and %-escapes represent octets. > - "It would seem almost wilfully perverse to consider the characters > represented > respectively by %7A and %7a in the example above as different." > > One can certainly argue about the stylistic merit of 'almost willfully > (spelling) > perverse'. But that's not my point. The sentence assumes that %7A and %7a > represent a character, Right, I'll fix that. > - "Another example: > > * http://a/b/ > * http://%61/b/ > > Such software might consider these equivalent, since %61 encodes the > character 'a' in both ASCII and UTF-8, but context becomes significant. > RFC2396 does not constrain the character encoding scheme of URIs; if the > original document were encoded in EBCDIC, or the URIs were sourced > from two > different documents whose original encoding was not known, there is a > (slim) > chance of a false-positive in finding these equivalent." > > This is very clearly and completely wrong. %61 and 'a' in an URI are > ALWAYS equivalent (when looking at %hh-escaping-equivalence). I'm having trouble here. Section 2.1 is terribly fuzzy on this, and says essentially nothing useful about the character->octet mapping, giving UTF-8 as an interesting example. By my reading, if that 'a' were encoded in EBCDIC in my instance, then RFC2396 wouldn't stop me from encoding that as %81. Now, I've never seen this happen, but the point is that assuming. > There are > two places where EBCDIC can come into play: > 1) the URI is represented as EBCDIC (e.g. if you read this mail on > an IBM mainframe). In that case, both 'a' and '%61' would be > represented in EBCDIC, but they would still be equivalent. Why couldn't 'a' be represented as %81? And if I imported the URI with this encoding from that system, it's quite possible that the EBCDIC and ASCIII versions of http://example.com/%81/ are in fact different. -Tim > 2) The resource is e.g. actually on an EBCDIC-based file system, > and the server exposes EBCDIC-based resource names directly. > Then both the 'a' and the '%61' would stand for a '/' (*) > (see e.g. http://www.egrannie.com/cheatsheets/asciiebcdic.html > for the actual table), or if there is an actual 'a' in the > resource name, it would have to be represented as %81. > [(*) that / would be a non-reserved one, i.e. a part of a > path component] Right, so I can't be sure that %81 is the same as %81, depending on where they come from. Or what am I missing? > > - "This is reasonable behavior based on the rules provided by RFC 2616, > which defines HTTP.": It may be worth mentioning that rfc 2616 also > defines the http: URI scheme, please see > http://www.ietf.org/rfc/rfc2616.txt, > section 3.2.2 Right. > - A point which is very important to mention is that software > transporting URIs should avoid any changes in URIs, unless it has > very, very good and specific reasons to do so. This will avoid > false negatives under any kind of equivalence. Right. > - "Web Robots, which are at pains to reduce the incidence of false > negatives" > 'are at pains' sounds colloquial and therefore difficult to understand > world-wide. Maybe 'try very hard'? Sigh. "At pains" is formal and perhaps a bit old-fashioned rather than colloquial. But OK. -Tim
Received on Wednesday, 11 December 2002 15:10:52 UTC