Re: revised "generic syntax" internet draft

Roy T. Fielding (fielding@kiwi.ICS.UCI.EDU)
Tue, 15 Apr 1997 17:10:48 -0700


To: Gary Adams - Sun Microsystems Labs BOS <Gary.Adams@east.sun.com>
Cc: uri@bunyip.com
Subject: Re: revised "generic syntax" internet draft 
In-Reply-To: Your message of "Tue, 15 Apr 1997 15:05:53 EDT."
             <libSDtMail.9704151505.29976.gra@zeppo> 
Date: Tue, 15 Apr 1997 17:10:48 -0700
From: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Message-Id:  <9704151710.aa29231@paris.ics.uci.edu>

>> d) what does the browser do with what the user typed
>>    in order to turn it into the URL that was generated in (a).
>
>Today the only alternative is to say the platform specific encoding
>of the server system must be %HH encoded as raw octets and published
>in the magazine, which the user enters as raw ascii strings, which is
>transmitted to the server where it is %HH decoded and handed to the local
>data store. i.e., it is only meaningful to the local server and is
>opaque to the magazine, the end user and the browser.

Right -- that is also the case if the author decides that the most
interoperable URL is represented in ASCII, even though the underlying
characters are non-ASCII.  [Note that it is always possible to include
both URLs in the magaizine.]

>If the encoding is labeled (or known to be UTF8), then the magazine
>could publish either native character representation or a %HH escaped
>URL. Similarly the browser could support input of native characters
>or a %HH escaped URL. Finally, the %HH escaped UTF8 URL is transmitted
>to the server and converted for use in accessing the local resource.

The magazine could also just publish the native character representation
and assume that the reader's browser is set up to use the same charset
encoding as the server.  OTOH, the standard could say that when a URL
is entered from a source that has no charset, use UTF-8.  The question is
really about what is the most likely charset used by the server.
This is the crux of the problem.

If a browser assumes that the server is using UTF-8 and transcodes the
non-ASCII octets before submission to the server, then bad things happen
if the server is not using UTF-8.  The nature of the "bad things" range
from disallowed access to invalid form data entry.  Since it is not
possible for us to require all servers to be upgraded, it is not safe
for browsers to perform transcoding of URLs, and therefore it is impossible
to deploy a solution that requires UTF-8 transcoding UNLESS that decision
is based on the URL scheme.

Likewise, a server often acts as a gateway for some parts of its namespace,
as is the case for CGI scripts and API modules like mod_php, and other
parts of its namespace are derived from filesystem names.  On a server
like Apache, the filesystem-based URLs are generated by url-encoding all
non-urlc bytes without concern for the filesystem charset.  While it is
theoretically possible for the server to edit all served content such
that URLs are identified and transcoded to UTF-8, that would assume that
the server knows what charset is used to generate those URLs in the
first place.  It can't use a single configuration table for all transcoding,
since the URLs may be generated from sources with varying charsets.
The bottom line is that a server cannot enforce UTF-8 encoding unless
it knows that all of its URLs and gateways use a common charset, and if
that were the case we wouldn't need a UTF-8 solution.

I listed out the solution space in the hope that people would see the
trade-offs.  We know that all-ASCII URLs *interoperate* well on the
Internet, but we also know that they can be ugly.  We know that existing
systems will accept non-ASCII URLs if the charset matches that used by
the URL generator/interpreter on the server.  We also know that most
existing, deployed servers are not restricted to generating UTF-8
encoded URLs.

In a perfect world, requiring UTF-8 would be a valid solution.  But this
is not a perfect world!  The purpose of an Internet standard is to define
the requirements for interoperability between implementations of the
applicable protocol.  A solution that requires UTF-8 will fail to interoperate
with systems that do not require UTF-8, and the latter is the case for
most URL-based systems on the Internet today.

 ...Roy T. Fielding
    Department of Information & Computer Science    (fielding@ics.uci.edu)
    University of California, Irvine, CA 92697-3425    fax:+1(714)824-1715
    http://www.ics.uci.edu/~fielding/