- From: Martin Duerst <duerst@w3.org>
- Date: Tue, 04 Feb 2003 17:52:54 -0500
- To: "Williams, Stuart" <skw@hplb.hpl.hp.com>, "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
- Cc: www-international@w3.org, Michel Suignard <michelsu@microsoft.com>
Hello Stuart, At 13:27 03/02/04 +0000, Williams, Stuart wrote: >I am prone to think of the "URI character sequence" as the sequence of >characters, constrainted by URI syntax, that I might write on a piece of >paper, or paint on the side of the bus. An "original character sequences" >seems to be more about the character sequence I might have wanted to paint >on the side of a bus, or present in a user interface (eg. kanji, ) that are >prohibited from direct by the constraints of generic URI syntax. Based on my long experience and repeated reading of RFC 2396, I think your interpretation comes very close. There is one caveat: "original character sequences" refers not only to characters that are prohibited from direct representation by the constraints of the (generic or opaque) URI syntax, but refers to any kind of character. And it may well be that you also wanted to have this on the side of a bus or in an user interface, but the important point is that that's what you originally had, for example in a file name or directory name if this is how the URI was made up, or the characters that you actually wanted to query for in the query part. >To come back to the one character or three question... '%7e' might be viewed >as 3 "URI Characters"; one "octet"; and one "original character" '~' >(maybe). Yes, exactly. The 'maybe' for '~' is quite appropriate. If somebody ran an http server on a computer where people still used e.g. the German version of ISO 646 (see http://www.itscj.ipsj.or.jp/ISO-IR/021.pdf), then the original character would be a sharp-s. As another example, '%7c' would be three URI characters, which correspond to one octet, which usually correspond to '|' (vertical line) as an original character, but which may also correspond to o-umlaut in the German version of ISO 646, as well as many other characters in other versions of ISO 646,... (fortunately, most ISO 646 versions except US-ASCII are pretty much dead these days). The general problem with all this language in RFC 2396 is that it's not easy for everybody to imagine characters being represented as octets being in turn represented as characters (and so on). But that's very difficult to fix. Regards, Martin.
Received on Tuesday, 4 February 2003 18:40:51 UTC