- From: Roy T. Fielding <fielding@kiwi.ICS.UCI.EDU>
- Date: Mon, 14 Apr 1997 19:32:32 -0700
- To: uri@bunyip.com
- Cc: Harald.T.Alvestrand@uninett.no
I am going to try this once more, and then end this discussion. I have already repeated these arguments several times, on several lists, over the past two years, and their substance has been repeatedly ignored by Martin and Francois. That is why I get pissed-off when Martin sends out statements of consensus which are not true, whether he realizes them to be false or not. The html-i18n RFC is a travesty because actual solutions to the problems were IGNORED in favor of promoting a single charset standard (Unicode). I personally would approve of systems using Unicode, but I will not standardize solutions which are fundamentally incompatible with existing practice. In my world, it is more important to have a deployable standard than a perfect standard. PROBLEM 1: Users in network environments where non-ASCII characters are the norm would prefer to use language-specific characters in their URLs, rather than ASCII translations. Proposal 1a: Do not allow such characters, since the URL is an address and not a user-friendly string. Obviously, this solution causes non-Latin character users to suffer more than people who normally use Latin characters, but is known to interoperate on all Internet systems. Proposal 1b: Allow such characters, provided that they are encoded using a charset which is a superset of ASCII. Clients may display such URLs in the same charset of their retrieval context, in the data-entry charset of a user's dialog, as %xx encoded bytes, or in the specific charset defined for a particular URL scheme (if that is the case). Authors must be aware that their URL will not be widely accessible, and may not be safely transportable via 7-bit protocols, but that is a reasonable trade-off that only the author can decide. Proposal 1c: Allow such characters, but only when encoded as UTF-8. Clients may only display such characters if they have a UTF-8 font or a translation table. Servers are required to filter all generated URLs through a translation table, even when none of their URLs use non-Latin characters. Browsers are required to translate all FORM-based GET request data to UTF-8, even when the browser is incapable of using UTF-8 for data entry. Authors must be aware that their URL will not be widely accessible, and may not be safely transportable via 7-bit protocols, but that is a reasonable trade-off that only the author can decide. Implementers must also be aware that no current browsers and servers work in this manner (for obvious reasons of efficiency), and thus recipients of a message would need to maintain two possible translations for every non-ASCII URL accessed. In addition, all GET-based CGI scripts would need to be rewritten to perform charset translation on data entry, since the server is incapable of knowing what charset (if any) is expected by the CGI. Likewise for all other forms of server-side API. Proposal 1a is represented in the current RFCs and the draft, since it was the only one that had broad agreement among the implementations of URLs. I proposed Proposal 1b as a means to satisfy Martin's original requests without breaking all existing systems, but he rejected it in favor of Proposal 1c. I still claim that Proposal 1c cannot be deployed and will not be implemented, for the reasons given above. The only advantage of Proposal 1c is that it represents the Unicode-uber-alles method of software standardization. Proposal 1b achieves the same result, but without requiring changes to systems that have never used Unicode in the past. If Unicode becomes the accepted charset on all systems, then Unicode will be the most likely choice of all systems, for the same reason that systems currently use whatever charset is present on their own system. Martin is mistaken when he claims that Proposal 1c can be implemented on the server-side by a few simple changes, or even a module in Apache. It would require a Unicode translation table for all cases where the server generates URLs, including those parts of the server which are not controlled by the Apache Group (CGIs, optional modules, etc.). We cannot simply translate URLs upon receipt, since the server has no way of knowing whether the characters correspond to "language" or raw bits. The server would be required to interpret all URL characters as characters, rather than the current situation in which the server's namespace is distributed amongst its interpreting components, each of which may have its own charset (or no charset). Even if we were to make such a change, it would be a disaster since we would have to find a way to distinguish between clients that send UTF-8 encoded URLs and all of those currently in existence that send the same charset as is used by the HTML (or other media type) page in which the FORM was obtained and entered by the user. The compromise that Martin proposed was not to require UTF-8, but merely recommend it on such systems. But that doesn't solve the problem, so why bother? PROBLEM 2: When a browser uses the HTTP GET method to submit HTML form data, it url-encodes the data within the query part of the requested URL as ASCII and/or %xx encoded bytes. However, it does not include any indication of the charset in which the data was entered by the user, leading to the potential ambiguity as to what "characters" are represented by the bytes in any text-entry fields of that FORM. Proposal 1a: Let the form interpreter decide what the charset is, based on what it knows about its users. Obviously, this leads to problems when non-Latin charset users encounter a form script developed by an internationally-challenged programmer. Proposal 1b: Assume that the form includes fields for selecting the data entry charset, which is passed to the interpreter, and thus removing any possible ambiguity. The only problem is that users don't want to manually select charsets. Proposal 1c: Require that the browser submit the form data in the same charset as that used by the HTML form. Since the form includes the interpreter resource's URL, this removes all ambiguity without changing current practice. In fact, this should already be current practice. Forms cannot allow data entry in multiple charsets, but that isn't needed if the form uses a reasonably complete charset like UTF-8. Proposal 1d: Require that the browser include a <name>-charset entry along with any field that uses a charset other than the one used by the HTML form. This is a mix of 1b and 1c, but isn't necessary given the comprehensive solution of 1c unless there is some need for multi-charset forms. Proposal 1e: Require all form data to be UTF-8. That removes the ambiguity for new systems, but does nothing for existing systems since there are no browsers that do this. Of course they don't, because the CGI and module scripts that interpret existing form data DO NOT USE UTF-8, and therefore would break if browsers were required to use UTF-8 in URLs. The real question here is whether or not the problem is real. Proposal 1c solves the ambiguity in a way which satisfies both current and any future practice, and is in fact the way browsers are supposed to be designed. Yes, early browsers were broken in this regard, but we can't fix early browsers by standardizing a proposal that breaks ALL browsers. The only reasonable solution is provided by Proposal 1c and matches what non-broken browsers do in current practice. Furthermore, IT HAS NOTHING TO DO WITH THE GENERIC URL SYNTAX, which is what we are supposed to be discussing. The above are the only two problems that I've heard expressed by Martin and Francois -- if you have others to mention, do so now. Neither of the above problems are solved by requiring the use of UTF-8, which is what Larry was saying over-and-over-and-over again, apparently to deaf ears. I think both of us are tired of the accusations of being ASCII-bigots simply because we don't agree with your non-solutions. Either agree to a solution that works in practice, and thus is supported by actual implementations, or we will stick with the status quo, which at least prevents us from breaking things that are not already broken. BTW, I am not the editor -- I am one of the authors. Larry is the only editor of the document right now. ...Roy T. Fielding Department of Information & Computer Science (fielding@ics.uci.edu) University of California, Irvine, CA 92697-3425 fax:+1(714)824-1715 http://www.ics.uci.edu/~fielding/
Received on Monday, 14 April 1997 22:48:21 UTC