Re: draft-fielding-url-syntax-05.txt

Chris Newman (Chris.Newman@innosoft.com)
Fri, 02 May 1997 11:07:04 -0700 (PDT)


Date: Fri, 02 May 1997 11:07:04 -0700 (PDT)
From: Chris Newman <Chris.Newman@innosoft.com>
Subject: Re: draft-fielding-url-syntax-05.txt
In-reply-to: <3369C320.5258@parc.xerox.com>
To: Larry Masinter <masinter@parc.xerox.com>
Cc: IETF URI list <uri@bunyip.com>
Message-id: <Pine.SOL.3.95.970502110058.18890J-100000@eleanor.innosoft.com>

On Fri, 2 May 1997, Larry Masinter wrote:

> 2. URL Characters and Escape Sequences
> 
>    URLs consist of a restricted set of characters, primarily chosen to
>    aid transcribability and usability both in computer systems and in
>    non-computer communications. Characters used conventionally as
>    delimiters around URLs were excluded.  The restricted set of
>    characters consists of digits, letters, and a few graphic symbols
>    were chosen from those common to most of the character encodings
>    and input facilities available to Internet users.
> 
>    Within a URL, characters are either used as delimiters, or to
>    represent strings of data (octets) within the delimited portions.
>    Octets are either represented directly by a character (using the
>    US-ASCII character for that octet) or by an escape encoding.  This
>    representation is elaborated below.
>    
> 2.1 URLs and non-ASCII characters   
>    
>    While URLs are sequences of characters and those characters are
>    used (within delimited sections) to represent sequences of octets,
>    in some cases those sequences of octets are used (via a 'charset'
>    or character encoding scheme) to represent sequences of characters:
>    
>    URL char. sequence <-> octet sequence <-> original char. sequence
>    
>    In cases where the original character sequence contains characters
>    that are strictly within the set of characters defined in the
>    US-ASCII character set, the mapping is simple: each original
>    character is translated into the US-ASCII code for it, and
>    subsequently represented either as the same character, or as an
>    escape sequence.
> 
>    In general practice, many different character encoding schemes are
>    used in the second mapping (between sequences of represented
>    characters and sequences of octets) and there is generally no
>    representation in the URL itself of which mapping was used. While
>    there is a strong desire to provide for a general and uniform
>    mapping between more general scripts and URLs, the standard for
>    such use is outside of the scope of this document.

I find this much too wishy-washy.  I think we should explicitly forbid the
use of 8-bit characters and hex-encoded 8-bit characters, except as
defined by the future I18N URL standard.  We need to make it very clear
that programs sending 8-bit URLs over the wire are broken (unless they use
UTF8 according to the future standard).