Re: draft-fielding-url-syntax-05.txt
Chris Newman (Chris.Newman@innosoft.com)
Fri, 02 May 1997 11:07:04 -0700 (PDT)
Date: Fri, 02 May 1997 11:07:04 -0700 (PDT)
From: Chris Newman <Chris.Newman@innosoft.com>
Subject: Re: draft-fielding-url-syntax-05.txt
In-reply-to: <3369C320.5258@parc.xerox.com>
To: Larry Masinter <masinter@parc.xerox.com>
Cc: IETF URI list <uri@bunyip.com>
Message-id: <Pine.SOL.3.95.970502110058.18890J-100000@eleanor.innosoft.com>
On Fri, 2 May 1997, Larry Masinter wrote:
> 2. URL Characters and Escape Sequences
>
> URLs consist of a restricted set of characters, primarily chosen to
> aid transcribability and usability both in computer systems and in
> non-computer communications. Characters used conventionally as
> delimiters around URLs were excluded. The restricted set of
> characters consists of digits, letters, and a few graphic symbols
> were chosen from those common to most of the character encodings
> and input facilities available to Internet users.
>
> Within a URL, characters are either used as delimiters, or to
> represent strings of data (octets) within the delimited portions.
> Octets are either represented directly by a character (using the
> US-ASCII character for that octet) or by an escape encoding. This
> representation is elaborated below.
>
> 2.1 URLs and non-ASCII characters
>
> While URLs are sequences of characters and those characters are
> used (within delimited sections) to represent sequences of octets,
> in some cases those sequences of octets are used (via a 'charset'
> or character encoding scheme) to represent sequences of characters:
>
> URL char. sequence <-> octet sequence <-> original char. sequence
>
> In cases where the original character sequence contains characters
> that are strictly within the set of characters defined in the
> US-ASCII character set, the mapping is simple: each original
> character is translated into the US-ASCII code for it, and
> subsequently represented either as the same character, or as an
> escape sequence.
>
> In general practice, many different character encoding schemes are
> used in the second mapping (between sequences of represented
> characters and sequences of octets) and there is generally no
> representation in the URL itself of which mapping was used. While
> there is a strong desire to provide for a general and uniform
> mapping between more general scripts and URLs, the standard for
> such use is outside of the scope of this document.
I find this much too wishy-washy. I think we should explicitly forbid the
use of 8-bit characters and hex-encoded 8-bit characters, except as
defined by the future I18N URL standard. We need to make it very clear
that programs sending 8-bit URLs over the wire are broken (unless they use
UTF8 according to the future standard).