Re: Using UTF-8 for non-ASCII Characters in URLs

Dan Oscarsson (
Wed, 30 Apr 1997 08:52:17 +0200 (MET DST)

Date: Wed, 30 Apr 1997 08:52:17 +0200 (MET DST)
From: Dan Oscarsson <>
Message-Id: <>
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs

> Since no one else has, here's a rough draft of a UTF-8 URL
> internet-draft, which I intend to submit in a few days time,
> after taking another pass on it.
> -----
> INTERNET-DRAFT			    Larry Masinter, Xerox Corporation
> draft-masinter-url-i18n-00xx	                       April 27, 1997
> Expires: October 27, 1997

> 3.2 Requirements for URL generation and interpretation
>    Systems that are offering resources through the internet
>    where those resources have logical names sometimes offer
>    the ability to generate URLs for the resources they offer.
>    For example, some HTTP servers offer the ability to
>    generate a 'directory listing' for file directories
>    under their purvue, and then to respond to the generated
>    URLs with the files. If the names of the files consist
>    solely of US-ASCII characters, the transcription is
>    simple, but other file systems offer a wider variety
>    of characters. It is recommended that the generation
>    of directories result in hex-encoded UTF-8 for non-USASCII
>    characters in the listing, and that the interpretation
>    of URLs accept both the raw UTF-8 or the hex-encoded version.

This is not right. A directory listing service generates a html document
that is sent back to the web browser. All URLs within a html document
should use the same character set as the document uses. That is, 
if the document uses iso 8859-1, the URLs will be in iso 8859-1, and
if the document is in UTF-8, the URLs will be in UTF-8.

If the browser knows how to handle the character set of the html document,
it also should know how to translate the embedded URLs into UTF-8 when
the user follows a link.

In general, URLs used without a context that defines the characters used,
should be encoded using UTF-8. URLs used within a context where the
meaning of the characters is defined should use the character encoding
of the context.