Re: proposed rewrite of 2.1 of draft-fielding-uri-syntax ... from Patrik Fältström on 1998-01-07 (uri@w3.org from January 1998)

From: Patrik Fältström <paf@swip.net>
Date: Wed, 07 Jan 1998 07:21:12 +0100
To: Larry Masinter <masinter@parc.xerox.com>
Cc: uri-i18n@unicode.org, fielding@ics.uci.edu, uri@bunyip.com, Jacob Palme <jpalme@dsv.su.se>
Message-Id: <3.0.3.32.19980107072112.006e75f8@nix.swip.net>
At 22:09 1998-01-04 PST, Larry Masinter wrote:
>Jacob Palme pointed out that the second (one line) paragraph
>of the current Section 2.1 of the URI/URL/whatever draft was
>hard to understand. With some amount of trepidation, I propose
>the following (alas lengthy) rewrite:

This is a good start, but I definitely think that the part talking about
UTF-8 have to talk more about multibyte character sets, which will give the
best example of what the difference is between a "URI character sequence"
and "original character sequence".

Ultimately I would want to have different words for the character in
US-ASCII which the octet in the URI represents and the character in the URI
(which can be represented by more than one character in US-ASCII).

>-----------------------------------------------
>2.1 URIs and non-ASCII characters   
>
>   The relationship between URIs and characters (for characters that
>   are not part of ASCII) has been a source of confusion. To describe
>   the relationship, it is useful to distinguish between a "character"
>   (as a distinguishable semantic entity) and an "octet" (an 8-bit
>   byte). There are two mappings, one from URI characters to octets,
>   and a second from octets to original characters:
>
>   URI character sequence->octet sequence->original character sequence
>
>   A URI is represented as a sequence of characters, not as a sequence
>   of octets. That is because URIs might be "transported" by means that
>   are not through a computer network, e.g., printed on paper, read
>   over the radio.

An example here would help. For example (I think this is what you are
saying?):

   http://foo.com/%31.html -> http://foo.com/A.html

   URI character sequence     Original characters

(I might calculate by hex value for A wrong...)

>   URI schemes may define a mapping from URI characters to octets;
>   whether this is done depends on the scheme. Commonly, within a
>   delimited section of a URI a sequence of characters may be
>   used to represent a sequence of octets. For example, the character
>   "a" represents the octet 97 (decimal), while the character sequence
>   "%", "0", "a" represents the octet 10 (decimal).
>
>   Secondarily, for some schemes and protocols, there is a second
>   translation: the sequence of octets defined by a component of the URI
>   is subsequently used to represent sequence of characters. A 'charset'
>   defines this mapping. There are many charsets in use in Internet
>   Protocols. For example, UTF8 [UTF8] defines a mapping from sequences
>   of octets to sequences of characters in the repertoire of ISO 10646.
>   
>   In the simplest case, the original character sequence contains
>   only characters that are defined in US-ASCII, and the two levels
>   of mapping are simple and easily invertable: each 'original character'
>   is represented as the octet for the US-ASCII code for it, which is,
>   in turn, represented as either the US-ASCII character, or else the
>   "%" escape sequence for that octet.
>
>   For original character sequences that contain non-ASCII characters,
>   however, the situation is more difficult. Internet protocols which
>   transmit octet sequences intended to represent character sequences
>   are expected to provide some way of identifying the charset to used,
>   if there might be more than one [RFC-char-standard].  However,
>   there is currently no provision within the generic URI syntax to
>   accomplish this identification. Of course, individual URI schemes
>   may provide a way to indicate the charset used or define a default
>   charset. In addition, there is no definition of the meaning of
>   characters outside of a limited repertoire for interpretation of
>   non-ASCII URI characters.  
>
>   It is expected that a systematic treatment of character encoding
>   within URIs will be developed as a future modification of this
>   specification.

To conclude, we have a three level mapping, which is as follows:

  Original characters -> Translitterated string -> URI sequence

What the URI scheme papers should talk about are the "Original characters"
and how the mappings to the translitterated strings should be done (i.e.
from what is printed on paper, what is equality between two such
strings...), while the URI syntax paper should only talk about the mappings
from the Translitterated string into the sequence of octets which makes the
URI sequence which is passed on the wire and seen as the only "safe" format
for printing of URIs.

Example one: One of the "Original characters" is 'A', which is represented
by a sequence of bytes, where the value of one of the bytes is the same as
the character '#' in US-ASCII. The URI syntax paper should say that the
byte value represented by the US-ASCII character '#' is not to be allowed
in the "translitterated string". The same thing should be valid for other
"specials".

Example two: One of the "original characters" is '#', which is represented
by one byte, where the value of it is NOT the same as the character '#' in
US-ASCII.

The URI syntax paper should also talk about equivalences between URI
sequences, i.e. what sequences do map to the same translitterated string. I
would like to have the translitterated string without the percent-quoting,
which means that the URI sequences with '%41' and 'A' are mapped to the
same translitterated string.

The URL syntax paper, and the URN syntax paper, should talk about the
mapping from the original characters and the translitterated string. In the
URN syntax paper for example, we have said that we only allow UNICODE in
the sequence of original characters, which in turn means that the mapping
to the translitterated string is defined by the UTF-8 encoding to make it
simpler to see that no character in the string of original characters map
to one of the forbidden octets in the translitterated string (according to
the URI syntax paper). This also gives some implications regarding
equivalence as the UNICODE character set defines that some sequences of
UNICODE characters are to be treated as the same!! We did it this way (i.e.
said that for URNs, it is the UNICODE string which can be printed, and not
only the URI sequence) as a try to make it possible for people to not only
print the URI sequence in newpapers, but the UNICODE string.

But, it is even more complicated than this. The user interface might not
use the character set defined in the URI scheme (in this case for URNs a
user interface might not use UNICODE natively). In this case, there must be
a third mapping from the user interface character set into the string of
original characters.

Geee....this is not fun....but it is ugly like this...


So, my suggestion is to take the text Larry suggests above, put that one in
the URI syntax paper together with simple examples like the ones I have
above. Then, add one paragraph which can be like:

"A sequence of characters in the "original characters" in the URI is to
first be translitterated to a sequence of bytes in something called the
"translitterated string". This mapping is to be defined by the URI scheme,
and is dependent on the character set allowed in the "original characters"
string. The "translitterated string" must in turn be converted into the
"URI sequence" which is the sequence of bytes which all operations on URIs
occur. In the "translitterated string" and the "URI sequence", some octets
are forbidden, namely all octets which in US-ASCII have the representation
of the following characters:

 '%', '#', ...

Any such specials (and any other octets in the translitterated string) can
be represented in the URI sequence by one percent sign and the hexadecimal
value of the octet. Note that some characters (such as '#') are forbidden
in both the translitterated string and the URI sequence.

It is up to the syntax definition of a URI scheme to define how the
mappings from the "original characters" string to the translitterated
string is to be made to minimize the problems with these special octets."

    Patrik


Email: paf@swip.net            URL: http://www.tele2.se
PGP: 4D38 91A4 27D9 C8B2 6975  D6BB 21D0 4C57 BD23 6602
Received on Wednesday, 7 January 1998 08:12:51 UTC