- From: Patrik Fältström <paf@swip.net>
- Date: Wed, 07 Jan 1998 07:21:12 +0100
- To: Larry Masinter <masinter@parc.xerox.com>
- Cc: uri-i18n@unicode.org, fielding@ics.uci.edu, uri@bunyip.com, Jacob Palme <jpalme@dsv.su.se>
At 22:09 1998-01-04 PST, Larry Masinter wrote: >Jacob Palme pointed out that the second (one line) paragraph >of the current Section 2.1 of the URI/URL/whatever draft was >hard to understand. With some amount of trepidation, I propose >the following (alas lengthy) rewrite: This is a good start, but I definitely think that the part talking about UTF-8 have to talk more about multibyte character sets, which will give the best example of what the difference is between a "URI character sequence" and "original character sequence". Ultimately I would want to have different words for the character in US-ASCII which the octet in the URI represents and the character in the URI (which can be represented by more than one character in US-ASCII). >----------------------------------------------- >2.1 URIs and non-ASCII characters > > The relationship between URIs and characters (for characters that > are not part of ASCII) has been a source of confusion. To describe > the relationship, it is useful to distinguish between a "character" > (as a distinguishable semantic entity) and an "octet" (an 8-bit > byte). There are two mappings, one from URI characters to octets, > and a second from octets to original characters: > > URI character sequence->octet sequence->original character sequence > > A URI is represented as a sequence of characters, not as a sequence > of octets. That is because URIs might be "transported" by means that > are not through a computer network, e.g., printed on paper, read > over the radio. An example here would help. For example (I think this is what you are saying?): http://foo.com/%31.html -> http://foo.com/A.html URI character sequence Original characters (I might calculate by hex value for A wrong...) > URI schemes may define a mapping from URI characters to octets; > whether this is done depends on the scheme. Commonly, within a > delimited section of a URI a sequence of characters may be > used to represent a sequence of octets. For example, the character > "a" represents the octet 97 (decimal), while the character sequence > "%", "0", "a" represents the octet 10 (decimal). > > Secondarily, for some schemes and protocols, there is a second > translation: the sequence of octets defined by a component of the URI > is subsequently used to represent sequence of characters. A 'charset' > defines this mapping. There are many charsets in use in Internet > Protocols. For example, UTF8 [UTF8] defines a mapping from sequences > of octets to sequences of characters in the repertoire of ISO 10646. > > In the simplest case, the original character sequence contains > only characters that are defined in US-ASCII, and the two levels > of mapping are simple and easily invertable: each 'original character' > is represented as the octet for the US-ASCII code for it, which is, > in turn, represented as either the US-ASCII character, or else the > "%" escape sequence for that octet. > > For original character sequences that contain non-ASCII characters, > however, the situation is more difficult. Internet protocols which > transmit octet sequences intended to represent character sequences > are expected to provide some way of identifying the charset to used, > if there might be more than one [RFC-char-standard]. However, > there is currently no provision within the generic URI syntax to > accomplish this identification. Of course, individual URI schemes > may provide a way to indicate the charset used or define a default > charset. In addition, there is no definition of the meaning of > characters outside of a limited repertoire for interpretation of > non-ASCII URI characters. > > It is expected that a systematic treatment of character encoding > within URIs will be developed as a future modification of this > specification. To conclude, we have a three level mapping, which is as follows: Original characters -> Translitterated string -> URI sequence What the URI scheme papers should talk about are the "Original characters" and how the mappings to the translitterated strings should be done (i.e. from what is printed on paper, what is equality between two such strings...), while the URI syntax paper should only talk about the mappings from the Translitterated string into the sequence of octets which makes the URI sequence which is passed on the wire and seen as the only "safe" format for printing of URIs. Example one: One of the "Original characters" is 'A', which is represented by a sequence of bytes, where the value of one of the bytes is the same as the character '#' in US-ASCII. The URI syntax paper should say that the byte value represented by the US-ASCII character '#' is not to be allowed in the "translitterated string". The same thing should be valid for other "specials". Example two: One of the "original characters" is '#', which is represented by one byte, where the value of it is NOT the same as the character '#' in US-ASCII. The URI syntax paper should also talk about equivalences between URI sequences, i.e. what sequences do map to the same translitterated string. I would like to have the translitterated string without the percent-quoting, which means that the URI sequences with '%41' and 'A' are mapped to the same translitterated string. The URL syntax paper, and the URN syntax paper, should talk about the mapping from the original characters and the translitterated string. In the URN syntax paper for example, we have said that we only allow UNICODE in the sequence of original characters, which in turn means that the mapping to the translitterated string is defined by the UTF-8 encoding to make it simpler to see that no character in the string of original characters map to one of the forbidden octets in the translitterated string (according to the URI syntax paper). This also gives some implications regarding equivalence as the UNICODE character set defines that some sequences of UNICODE characters are to be treated as the same!! We did it this way (i.e. said that for URNs, it is the UNICODE string which can be printed, and not only the URI sequence) as a try to make it possible for people to not only print the URI sequence in newpapers, but the UNICODE string. But, it is even more complicated than this. The user interface might not use the character set defined in the URI scheme (in this case for URNs a user interface might not use UNICODE natively). In this case, there must be a third mapping from the user interface character set into the string of original characters. Geee....this is not fun....but it is ugly like this... So, my suggestion is to take the text Larry suggests above, put that one in the URI syntax paper together with simple examples like the ones I have above. Then, add one paragraph which can be like: "A sequence of characters in the "original characters" in the URI is to first be translitterated to a sequence of bytes in something called the "translitterated string". This mapping is to be defined by the URI scheme, and is dependent on the character set allowed in the "original characters" string. The "translitterated string" must in turn be converted into the "URI sequence" which is the sequence of bytes which all operations on URIs occur. In the "translitterated string" and the "URI sequence", some octets are forbidden, namely all octets which in US-ASCII have the representation of the following characters: '%', '#', ... Any such specials (and any other octets in the translitterated string) can be represented in the URI sequence by one percent sign and the hexadecimal value of the octet. Note that some characters (such as '#') are forbidden in both the translitterated string and the URI sequence. It is up to the syntax definition of a URI scheme to define how the mappings from the "original characters" string to the translitterated string is to be made to minimize the problems with these special octets." Patrik Email: paf@swip.net URL: http://www.tele2.se PGP: 4D38 91A4 27D9 C8B2 6975 D6BB 21D0 4C57 BD23 6602
Received on Wednesday, 7 January 1998 08:12:51 UTC