Re: Using UTF-8 for non-ASCII Characters in URLs

Gary Adams - Sun Microsystems Labs BOS (Gary.Adams@east.sun.com)
Wed, 30 Apr 1997 07:46:18 -0400


Date: Wed, 30 Apr 1997 07:46:18 -0400
From: Gary.Adams@east.sun.com (Gary Adams - Sun Microsystems Labs BOS)
Message-Id: <199704301146.HAA25655@zeppo.East.Sun.COM>
To: Dan.Oscarsson@trab.se, masinter@parc.xerox.com, uri@bunyip.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs

> From: Dan Oscarsson <Dan.Oscarsson@trab.se>
...
> >    characters in the listing, and that the interpretation
> >    of URLs accept both the raw UTF-8 or the hex-encoded version.
> > 
> 
> This is not right. A directory listing service generates a html document
> that is sent back to the web browser. All URLs within a html document
> should use the same character set as the document uses. That is, 
> if the document uses iso 8859-1, the URLs will be in iso 8859-1, and
> if the document is in UTF-8, the URLs will be in UTF-8.

I have been experimenting with a mixed character set directory just to
see what would happen if a site did need to support multiple encodings.
Don't know if this would be helpful in the current discussion :

   http://www.sunlabs.com/research/ila/test_i18n/

(If your browser can't access the directory try viewing the 
README.txt or the cgi.html file in that directory first.)

I'm not sure I understand the comment above about the document
character set and the encoding of the URLs. If I start with 
a Unix server with EUC-jp encoded filenames and generate a
directory listing, today the 8-bit bytes are %HH escaped
for the URL references (so they are now safe ASCII transmitable 
sequences) and the "document text" is transmitted as 8-bit
bytes assumed to be iso-8859-1 characters.

For the most part all "import/export" type operations such as
"cut-and-paste" of document text or "save-url-in-hotlist" are 
"non-destructive" . e.g., The text content is preserved and the 
subsequent url operations are still functional, even though the 
document text is not rendered correctly and the url encoding 
is ambiguous.

> 
> If the browser knows how to handle the character set of the html document,
> it also should know how to translate the embedded URLs into UTF-8 when
> the user follows a link.

I think it is important to keep the document text and the URL "text"
separate entities. An EUC-jp file name is a reasonable restriction on
some platforms that might provide storage for SJIS encoded documents.
When a web server delivers the content of that document on the web the
document content would be appropriately labeled or converted for
transport and the URL could be encoded in any form acceptable for
retreiving the file in a later transaction.

> 
> In general, URLs used without a context that defines the characters used,
> should be encoded using UTF-8. URLs used within a context where the
> meaning of the characters is defined should use the character encoding
> of the context.

I'm not sure that it is a good idea to tie the URL encoding
interpretation to its immediate context. If I attempt to
"Save" a document from the browser (or a spider agent is gathering
documents automatically from the web), then the characters of the URL
are often used to form a local file system name for the fetched object.
So I fetch a SJIS named file via an http server and save it in my
~/public_html EUC-jp file system.  Using UTF8 on the wire (if
prearranged) allows both sites to use meaningful names for their local
resources and to safely share the public handles for the information.

\
/gra