UTF-8 and URLs

Larry Masinter (masinter@parc.xerox.com)
Thu, 24 Apr 1997 09:56:56 PDT

Message-Id: <335F90D8.6EDB@parc.xerox.com>
Date: Thu, 24 Apr 1997 09:56:56 PDT
From: Larry Masinter <masinter@parc.xerox.com>
To: John C Klensin <klensin@mci.net>
Cc: uri@bunyip.com
Subject: UTF-8 and URLs


Your clarification didn't help me. And the sticking point
for me is that "as a sequence of glyphs" is an important 
part of the transport of URLs, whether those glyphs are
on paper or on the screen, and that the octet->glyph
and glyph->octet route is really error-prone.

I think to actually solve the problem of Internationalization
of URLs we need two recommendations:

a) If you're writing software that displays URLs to users,
    1) any 'forbidden' octets should be displayed as if
      they were UTF-8 encoded characters. That is, those
      octets are currently disallowed in URLs, but if you
      see them, display them in a standard way.
    2) Any sequences of %HH-encoded octets should be displayed
       EITHER as <%><H><H>, e.g., just show the encoding
       in ASCII, OR by assuming that they're hex-encoded
       UTF-8. The latter assumption is likely to be wrong
       for now, but might change later.

b) If you're writing software that lets users type in URLs,
   then if the user types in any character that isn't legal
   in a URL, encode the character as hex-encoded UTF-8. For
   Japanese, avoid using double-wide characters. For RTL
   scripts such as Hebrew or Arabic, leave out any direction
   changes and encode the characters in logical, not presentation

   Since there haven't been any standards for non-ASCII character
   representations, this is as good a choice as any.

c) If you're writing software that generates URLs to be
   interpreted later, then use hex-encoded UTF-8 for the
   encoding to generate, and accept either the raw UTF-8
   or the hex-encoded version as identifying the same resource.
   This is a recommendation for HTTP servers and FTP servers
   and a variety of other implementations.

These three recommendations affect software from a large number
of different producers. To make progress in the community,
those software implementors will need to agree that this is
the best solution to interoperability of URLs internationally.

I think given its likely controversial nature, we should clearly
make these recommendations in a separate RFC, and perhaps with
a new working group.

I'm willing to put this all down in a separate internet draft,
if it will help focus the process on actually making progress.
Some of the examples that have been sent out to the mailing list
will be useful to guide the recommendations in the RFC.