Re: URLs and internationalization

Martin J. Duerst (
Thu, 26 Dec 1996 19:46:03 +0100 (MET)

Date: Thu, 26 Dec 1996 19:46:03 +0100 (MET)
From: "Martin J. Duerst" <>
To: Larry Masinter <>
Subject: Re: URLs and internationalization
In-Reply-To: <96Dec26.011608pst."2694">
Message-Id: <Pine.SUN.3.95.961226192254.245N-100000@enoshima>

Hello Larry - Many thanks for your great work.

I did not have time to analyse the resulting text with all details
(I am just in the office between two weeks of vacation and some
other urgent work, and with a heavy cold), but I am very glad to
see the changes you have made. Below are some small comments.

As for "omission", I think it can be a good idea in many cases,
but it can also be very dangerous. It lets various peoples believe
different things without becomming aware of their different
interpretations. So I hope we don't omit too much.

> I think I've managed, with only a little circumlocution, to
> reintroduce the 'octet' terminology.

Very good. Makes things much clearer.

> >    The 8-bit coded character set of the octet must be a superset of the
> >    US-ASCII coded character set, such that the US-ASCII characters have
> >    the same escaped encoding regardless of the larger octet character
> >    set.
> I dropped this entire section; I agree that there are some URL schemes
> where there is no coded character set at all.


> I dropped section 6 since 'adding new URL schemes' will be a separate
> document.


> I didn't change "URL Reference" to "URL" and "URL" to something else,
> since that would be too extensive a change. I'm still willing to
> consider doing so.

Without any support from others in the group, changing terminology
is clearly too much work. As I said earlier, there is an easier way
to deal with the issue. I think the forward references you have put
in are good enough.

>   2. URL Characters and Character Escaping
> !    All URLs consist of a restricted set of characters, chosen
> !    primarily to aid transcribability and usability both in computer
> !    systems and in non-computer communications. In addition, characters
> !    used conventionally as delimiters around URLs were excluded.  The
> !    restricted set of characters consists of digits, letters, and a few
> !    graphic symbols corresponding to a subset of the graphic printable
> !    characters of the US-ASCII coded character set [11]; they are
> !    common to most of the character encodings and typing systems
> !    available to Internet users.

Small suggestion: Change "typing systems" to "input systems" or
"input facilities" or "input mechanisms".

> ! 1.5. Characters, octets, and encodings
> ! 
> !    URLs are sequences of characters. Parts of those sequences of
> !    characters are then used to represent sequences of octets. In turn,
> !    sequences of octets are (frequently) used (with a character
> !    encoding scheme) to represent characters. This means that when
> !    dealing with URLs it's necessary to work at three levels:
> ! 
> !                      represented characters
> !                                 ^
> !                                 |
> !                                 v
> !                               octets
> !                                 ^
> !                                 |
> !                                 v
> !                          URL characters
> ! 
> !    This looks more complicated than necessary if all one is dealing
> !    with is file names in ASCII, but is necessary when dealing with the
> !    wide variety of systems in use. URL characters may represent octets
> !    directly or with escape sequences (Section 2.3). Octets may
> !    sometimes represent characters in ASCII, or in other character
> !    encodings, or sometimes be used to represent data that does not
> !    correspond to characters at all.

Very nice! Great work!

> --- 340,377 ----
>      of the URL, but this should not be done unless the URL is being used
>      in a context which does not allow the unescaped character to appear.
> ! 2.3. Escaped "Characters"
> !    Data must be escaped if it does not have a representation using an
> !    unreserved character; this includes data that does not correspond
> !    to a printable character of the US-ASCII coded character set, and
> !    also data that corresponds to characters used to delimit a URL from
> !    its context.

Looks to me as if the title could be changed to "Escaped Octets",
without '"', of cours.
>   2.3.1. Escaped Encoding
> !    An escaped character is encoded as a character triplet, consisting
> !    of the percent character "%" followed by the two hexadecimal digits
>      representing the character's octet code in an 8-bit coded character
> !    set.  For example, "%20" is the escaped encoding for the US-ASCII
> !    space character.

Probably better to use "escaped octet" again.

Happy New Year,		Martin.