Re: revised "generic syntax" internet draft from Roy T. Fielding on 1997-04-16 (uri@w3.org from April 1997)

From: Roy T. Fielding <fielding@kiwi.ICS.UCI.EDU>
Date: Tue, 15 Apr 1997 19:19:03 -0700
To: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
Cc: uri@bunyip.com
Message-Id: <9704151919.aa12134@paris.ics.uci.edu>
>> I personally would
>> approve of systems using Unicode, but I will not standardize solutions
>> which are fundamentally incompatible with existing practice.
>
>What "fundamental incompatibility"? Is a recommendation suggesting
>the use of a particularly well suited character encoding a
>"fundamental incompatibility" when at present we don't know
>the character encoding anyway?

Yes, because at present we don't tell the client to transcode the URL.
Any transcoding is guaranteed to fail on some systems, because the
URL namespace has always been private to the generator (the server in
"http" or "ftp" or "gopher" URLs, the filesystem in "file" URLs, etc.).

>> Proposal 1b: Allow such characters, provided that they are encoded using
>>              a charset which is a superset of ASCII.  Clients may display
>>              such URLs in the same charset of their retrieval context,
>>              in the data-entry charset of a user's dialog, as %xx encoded
>>              bytes, or in the specific charset defined for a particular
>>              URL scheme (if that is the case).  Authors must be aware that
>>              their URL will not be widely accessible, and may not be safely
>>              transportable via 7-bit protocols, but that is a reasonable
>>              trade-off that only the author can decide.
>
>The problem here is that it's not display or 7-bit channels or whatever
>that makes this proposal fail. It is that URLs are transferred from
>paper to the computer and back, and that may happen many times.
>In a recent message to Francois, you seem to have completely ignored
>this fact, you spoke about an URL only being an URL after it is
>input into the browser. Yet the draft says:
>
>                                 A URL may be represented in a variety
>   of ways: e.g., ink on paper, pixels on a screen, or a sequence of
>   octets in a coded character set.

I didn't ignore that fact -- I wrote it.  You wanted a method to localize
URLs in spite of the interoperability problem.  Well, this is just one
aspect of the interoperability problem.  What is more likely: the client
knows how to transcode from the data-entry dialog charset to UTF-8,
or the user is using the same charset as the server?  On my system, the
latter is more likely.  I suspect that this will remain an interoperability
problem for some time, regardless of what the URL standard says.

Proposal 1b allows cooperating systems to have localized URLs that work
(at least locally) on systems deployed today.

>> Proposal 1c: Allow such characters, but only when encoded as UTF-8.
>>              Clients may only display such characters if they have a
>>              UTF-8 font or a translation table.
>
>There are no UTF-8 fonts. And the new browsers actually have such
>translation tables already, and know how to deal with the fonts
>they have on their system. And those that don't, they wont be worse
>off than up to now.

Unless they are currently using iso-8859-1 characters in URLs, on pages
encoded using iso-8859-1, which are also displayed correctly by far more
browsers than just the ones you refer to.  Likewise for EUC URLs on
EUC-encoded pages, and iso-2022-kr URLs on iso-2022-kr-encoded pages.
The fact is, these browsers treat the URL as part of the HTML data
stream and, for the most part, display it according to that charset
and not any universal charset.

>>	Servers are required to
>>              filter all generated URLs through a translation table, even
>>              when none of their URLs use non-Latin characters.
>
>Servers don't really generate URLs. They accept URLs in requests
>and try to match them with the resources they have. The URLs get
>created, implicitly, by the users who name resources and enter data.

The Apache source code is readily available and includes, as distributed,
five different mechanisms that generate URLs: directory listings,
configuration files (<Location>), request rewrite modules
(redirect/alias/rewrite), request handling modules (imap), and CGI scripts.
The first two are part of the server core and related to the filesystem,
and thus could be mapped to a specific charset and thereby to a translation
table per filesystem, with some overhead.  The next two (modules) can be
plugged-in or out based on user preference and there is no means for the
server to discover whether or not they are generating UTF-8 encoded URLs,
so we would have to assume that all modules would be upgraded as well.
CGI scripts suffer the same problem, but exacerbated by the fact that CGI
authors don't read protocol specs.

I am not talking theoretically here -- the above describes
approximately 42% of the installed base of publically accessible
Internet HTTP servers.  It would be nice to have a standard that
doesn't make them non-compliant.

>Anyway, it is extremely easy and simple for a server to test
>whether an URL contains only ASCII, and in that case not do any
>kind of transcoding. And this test will be extremely efficient
>and cheap.

It is extremely easy to not solve the problem, yes.

>>	Browsers
>>              are required to translate all FORM-based GET request data
>>              to UTF-8, even when the browser is incapable of using UTF-8
>>              for data entry.
>
>Let's come back to forms later. They are a special case.

They are not a special case if UTF-8 encoding is required, which is
the nature of this proposal (the one that is supposed to fix the problem).

>>	Authors must be aware that their
>>              URL will not be widely accessible, and may not be safely
>>              transportable via 7-bit protocols, but that is a reasonable
>>              trade-off that only the author can decide.
>
>What 7-bit protocols? The internet is 8-bit throughout. Mail is
>7-bit, and that might be your concern.

Yes, E-mail is the concern.  It was (and is) one of the requirements
for transmitting URLs, which is one reason we utilize the %xx encoding
instead of just binary.

>>	Implementers
>>              must also be aware that no current browsers and servers
>>              work in this manner (for obvious reasons of efficiency),
>>              and thus recipients of a message would need to maintain two
>>              possible translations for every non-ASCII URL accessed.
>
>With exception of very dense namespaces such as with FORMs, it is
>much easier to do transcoding on the server. This keeps upgrading
>in one spot (i.e. a server can decide to switch on transcoding and
>other things if its authors are giving out beyond-ASCII URLs).

As a server implementer, I say that claim is bogus.  It isn't even
possible to do that in Apache.  Show me the code before trying to
standardize it.

>>              In addition, all GET-based CGI scripts would need to be
>>              rewritten to perform charset translation on data entry, since
>>              the server is incapable of knowing what charset (if any)
>>              is expected by the CGI.  Likewise for all other forms of
>>              server-side API.
>
>Again, this is forms. See later.
>
>
>> Proposal 1a is represented in the current RFCs and the draft,
>> since it was the only one that had broad agreement among the
>> implementations of URLs.  I proposed Proposal 1b as a means
>> to satisfy Martin's original requests without breaking all
>> existing systems, but he rejected it in favor of Proposal 1c.
>
>I showed above why Proposal 1b doesn't work. The computer->
>paper->computer roundtrip that is so crucial for URLs is
>completely broken.

Not completely.  It only breaks down when you transmit localized
URLs outside the local environment.  That is the price you pay.

>My proposal is not identical to Proposal 1c. It leaves
>everybody the freedom to create URLs with arbitrary
>octets. It's a recommendation.

Then it doesn't solve the problem.

>> Proposal 1b achieves the same result, but without requiring
>> changes to systems that have never used Unicode in the past.
>> If Unicode becomes the accepted charset on all systems, then
>> Unicode will be the most likely choice of all systems, for the
>> same reason that systems currently use whatever charset is present
>> on their own system.
>
>There is a strong need for interoperability. We don't want to
>force anybody to cooperate, but there are quite some users
>that want to use their local encoding on their boxes, but still
>want to be able to exchange URLs the way English speakers do
>with basic ASCII. The only way to do this is to specify some
>character encoding for interoperability, and the only such
>available encoding is Unicode/ISO 10646/JIS 221/KS 5700/...
>It's not "just another character standard", it's THE international
>standard to which all other character standards are alligned.

If you want interoperability, you must use US-ASCII.  Getting 
interoperability via Unicode won't be possible until all systems
support Unicode, which is not the case today.  Setting out goals
for eventual standardization is a completely different task than
actually writing what IS the standard, which is why Larry asked
that it be in a separate specification.  If you are convinced that
Unicode is the only acceptable solution for FUTURE URLs, then
write a proposed standard for FUTURE URLs.  To make things easier,
I have proposed changing the generic syntax to allow more octets
in the urlc set, and thus not even requiring the %xx encoding.
In that way, if UTF-8 is someday accepted as the only standard,
then it won't conflict with the existing URL standard.

>> Martin is mistaken when he claims that Proposal 1c can be implemented
>> on the server-side by a few simple changes, or even a module in Apache.
>> It would require a Unicode translation table for all cases where the
>> server generates URLs, including those parts of the server which are
>> not controlled by the Apache Group (CGIs, optional modules, etc.).
>
>There are ways that a CGI script can tell the server which encoding
>it wants the same way that there are ways to identify the filenames
>in a particular directory to be in a certain encoding. Anyway,
>for FORM/query part, please see below.

I know of no such "ways".  Perhaps you are talking about inventing one.
Keep in mind that it has to be known prior to the server exec'ing the
script, and the server has to know what charset was intended by the
client (i.e., the charset of the form to which the client is sending
in the action).  In other words, it is impossible.

>> We cannot simply translate URLs upon receipt, since the server has no
>> way of knowing whether the characters correspond to "language" or
>> raw bits.  The server would be required to interpret all URL characters
>> as characters, rather than the current situation in which the server's
>> namespace is distributed amongst its interpreting components, each of which
>> may have its own charset (or no charset).
>
>There is indeed the possibility that there is some raw data in an
>URL. But I have to admit that I never yet came across one. The data:
>URL by Larry actually translates raw data to BASE64 for efficiency
>and readability reasons.
>And if you study the Japanese example above, you will also very well
>see that assuming that some "raw bits" get conserved is a silly idea.
>Both HTML and paper, as the main carriers of URLs, don't conserve
>bit identity; they converve character identity. That's why the
>draft says:
>
>                                     The interpretation of a URL
>   depends only on the characters used and not how those characters
>   are represented on the wire.
>
>This doesn't just magically stop at 0x7F!

Transcoding a URL changes the bits.  If those bits were not characters,
how do you transcode them?  Moreover, how does the transcoder differentiate
between %xx as data and %xx as character needing to be transcoded to UTF-8?
It can't, so requiring UTF-8 breaks in practice.

>> Even if we were to make such
>> a change, it would be a disaster since we would have to find a way to
>> distinguish between clients that send UTF-8 encoded URLs and all of those
>> currently in existence that send the same charset as is used by the HTML
>> (or other media type) page in which the FORM was obtained and entered
>> by the user.
>
>I have shown how this can work easily for sparce namespaces. The solution
>is to test both raw and after conversion from UTF-8 to the legacy encoding.
>This won't need many more accesses to the file system, because if a
>string looks like correct UTF-8, it's extremely rare that it is something
>else, and if doesn't look like correct UTF-8, there is no need to
>transcode.

You keep waving around this "easily" remark without understanding the
internals of a server.  If I thought it was easy to do those things,
they would have been implemented two years ago.
What do you do if you have two legacy resources, one of which has the
same octet encoding as the UTF-8 transcoding of the other?  How do you
justify the increased time to access a resource due to the failed
(usually filesystem) access on every request?  Why should a busy server
do these things when they are not necessary to support existing web
services?

>> The compromise that Martin proposed was not to require UTF-8, but
>> merely recommend it on such systems.  But that doesn't solve the problem,
>> so why bother?
>
>The idea of recommending UTF-8 has various reasons:
>
>- Don't want to force anybody to use UNicode/ISO 10646.
>- Don't want to make old URLs illegal, just have a smooth
>	transition strategy.
>- Fits better into a Draft Standard.

That's all fine and good, but it doesn't solve the problem.  If we don't
need to solve the problem, then the draft should progress as it stands.
After all, there are at least a hundred other problems that the draft
*does* solve, and you are holding it up.

>> PROBLEM 2:  When a browser uses the HTTP GET method to submit HTML
>>             form data, it url-encodes the data within the query part
>>             of the requested URL as ASCII and/or %xx encoded bytes.
>>             However, it does not include any indication of the charset
>>             in which the data was entered by the user, leading to the
>>             potential ambiguity as to what "characters" are represented
>>             by the bytes in any text-entry fields of that FORM.
>
>This is indeed the FORMs/Query part problem. It's harder because
>there is more actual use of beyond-ASCII characters, and because
>the namespace is dense. But it is also easier because it is mainly
>a problem between server and browser, with a full roundtrip to paper
>and email only in rare cases.
>
>I like the list of proposals that Roy has made below, because we
>came up with a very similar list at the last Unicode conference
>in Mainz.
>
>> Proposal 1a:  Let the form interpreter decide what the charset is, based
>>               on what it knows about its users.  Obviously, this leads to
>>               problems when non-Latin charset users encounter a form script
>>               developed by an internationally-challenged programmer.
>> 
>> Proposal 1b:  Assume that the form includes fields for selecting the data
>>               entry charset, which is passed to the interpreter, and thus
>>               removing any possible ambiguity.  The only problem is that
>>               users don't want to manually select charsets.
>> 
>> Proposal 1c:  Require that the browser submit the form data in the same
>>               charset as that used by the HTML form.  Since the form
>>               includes the interpreter resource's URL, this removes all
>>               ambiguity without changing current practice.  In fact,
>>               this should already be current practice.  Forms cannot allow
>>               data entry in multiple charsets, but that isn't needed if the
>>               form uses a reasonably complete charset like UTF-8.
>
>This is mostly current practice, and it is definitely a practice
>that should be pushed. At the moment, it should work rather well,
>but the problems appear with transcoding servers and proxies.
>For transcoding servers (there are a few out there already),
>the transcoding logic or whatever has to add some field (usually
>a hidden field in the FORM) that indicates which encoding was
>sent out. This requires a close interaction of the transcoding
>part and the CGI logic, and may not fit well into a clean
>server architecture.

Isn't that what I've been saying?  Requiring UTF-8 would require all
servers to be transcoding servers, which is a bad idea.  I am certainly
not going to implement one, which should at least be a cause for concern
amongst those proposing to make it the required solution.

>For a transcoding proxy (none out there
>yet as of my knowledge, but perfectly possible with HTTP 1.1),
>the problem gets even worse.

Whoa!  A transcoding proxy is non-compliant with HTTP/1.1.
See the last part of section 5.1.2 in RFC 2068.

>> Either agree to
>> a solution that works in practice, and thus is supported by actual
>> implementations, or we will stick with the status quo, which at least
>> prevents us from breaking things that are not already broken.
>
>Well, I repeat my offer. If you help me along getting into Apache,
>or tell me whom to contact, I would like to implement what I have
>described above. The deadline for submitting abstracts to the
>Unicode conference in San Jose in September is at the end of this
>week. I wouldn't mind submitting an abstract there with a title
>such as "UTF-8 URLs and their implementation in Apache".

Sorry, I'd like to finish my Ph.D. sometime this century.
How can I help you do something that is already known to be impossible?

 ...Roy T. Fielding
    Department of Information & Computer Science    (fielding@ics.uci.edu)
    University of California, Irvine, CA 92697-3425    fax:+1(714)824-1715
    http://www.ics.uci.edu/~fielding/
Received on Tuesday, 15 April 1997 22:20:39 UTC