Using UTF-8 for non-ASCII Characters in URLs

Larry Masinter (masinter@parc.xerox.com)
Sun, 27 Apr 1997 17:33:57 PDT


Message-Id: <3363F075.195B@parc.xerox.com>
Date: Sun, 27 Apr 1997 17:33:57 PDT
From: Larry Masinter <masinter@parc.xerox.com>
To: uri@bunyip.com
Subject: Using UTF-8 for non-ASCII Characters in URLs

Since no one else has, here's a rough draft of a UTF-8 URL
internet-draft, which I intend to submit in a few days time,
after taking another pass on it.


-----
INTERNET-DRAFT			    Larry Masinter, Xerox Corporation
draft-masinter-url-i18n-00xx	                       April 27, 1997
Expires: October 27, 1997


              Using UTF-8 for non-ASCII Characters in URLs

Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups.  Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet-Drafts
   as reference material or to cite them other than as ``work in
   progress.''

   To learn the current status of any Internet-Draft, please check the
   ``1id-abstracts.txt'' listing contained in the Internet-Drafts
   Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net
   (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East
   Coast), or ftp.isi.edu (US West Coast).

Abstract

   Traditionally, URLs have been written in ASCII and used both
   as a method of transcription and identification, but also in
   advertising, magazines and newspapers. This document specifies
   a uniform way of representing non-ASCII scripts in URLs so
   that they can be used for the world's languages.

1. Introduction

   URLs were defined as ASCII-only. This leaves out most of the
   world of people, who do not write merely with the letters A-Z.

2. Syntax

   This memo defines two kinds of URLs: hex-encoded URLs, and
   8-bit URLs. For traditional URLs (those defined in
   [RFC-URL-SYNTAX]), characters should be translated into
   UTF8 and then any octet not allowed in RFC-URL-SYNTAX
   should be hex-encoded.

   For 8-bit-URLs, it is only necessary to hex-encode reserved
   characters, delims, unwise special characters, and white space.

3. Software Requirements

   Supporting UTF-8 URLs requires cooperation from the providers
   of three different components of URL software.

3.1 Requirements for URL entry

   One component of software that deals with URLs allows
   users to type in the URLs. That is, a human transcribes
   a visual representation of a URL (as a sequence of
   glyphs, in some order, in some visual display) using
   some entry method that will result in a URL.

   If the visual representation contains only those
   characters that were defined as part of the [RFC-URL-SYNTAX]
   standard syntax of URLs, the transcription would be
   simple. However, for all other sequences of characters,
   entry should result in characters, in logical order,
   from the ISO 10646 character repertoire, encoded using
   the UTF-8 method [RFC 2044], and then subsequently
   encoded using the URL escape mechanism [RFC-URL-SYNTAX].

   Some care must be taken in this conversion, for example,
   that all accented characters should be translated into
   their combined form, that no extraneous BIDI marks be
   left in the resulting stream, that any characters that
   _look_ like ASCII characters be transcribed into their
   ascii equivalents and not, for example, as double-wide
   characters.

3.2 Requirements for URL generation and interpretation
   
   Systems that are offering resources through the internet
   where those resources have logical names sometimes offer
   the ability to generate URLs for the resources they offer.
   For example, some HTTP servers offer the ability to
   generate a 'directory listing' for file directories
   under their purvue, and then to respond to the generated
   URLs with the files. If the names of the files consist
   solely of US-ASCII characters, the transcription is
   simple, but other file systems offer a wider variety
   of characters. It is recommended that the generation
   of directories result in hex-encoded UTF-8 for non-USASCII
   characters in the listing, and that the interpretation
   of URLs accept both the raw UTF-8 or the hex-encoded version.

   This recommendation applies to HTTP servers, and also
   to FTP servers, gopher servers, etc.

3.3 Requirements for display of URLs

   Software that displays URLs to users (or other kinds of
   transcription, e.g., deciding what to print in your
   magazine) should of course be robust: don't tell users
   about a URL they can't type!

    1) any 'forbidden' octets should be displayed as if
      they were UTF-8 encoded characters. That is, those
      octets are currently disallowed in URLs, but if you
      see them, display them in a standard way.
    2) Any sequences of %HH-encoded octets should be displayed
       EITHER as <%><H><H>, e.g., just show the encoding
       in ASCII, OR by assuming that they're hex-encoded
       UTF-8. The latter assumption is likely to be wrong
       for now, but might change later.

Summary

These three recommendations, when taken together, will allow
for localization of URLs; users may use their native scripts
for writing URLs that are "uniform" for their context.

Acknowledgements

References

[RFC 2044] 
[RFC-URL-SYNTAX]