- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Tue, 15 Apr 1997 16:31:51 +0200 (MET DST)
- To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
- Cc: uri@bunyip.com, Harald.T.Alvestrand@uninett.no
On Mon, 14 Apr 1997, Roy T. Fielding wrote: > I am going to try this once more, and then end this discussion. > I have already repeated these arguments several times, on several > lists, over the past two years, and their substance has been repeatedly > ignored by Martin and Francois. Thanks for entering into serious discussion. It is true that it was about two years ago when I for the first time contacted the uri group and asked about internationalization and URLs. I quickly saw at that time that there were rather fixed oppinions about what an URL had to be (kind of like a telephone number) and that typability on ASCII keyboards seemed more important than anything else. Also, I didn't have any idea of how the solution should have to look. But I remained with the impression that denying the benefits of natural-language URLs to people outside the basic Latin world was neither fair nor technically necessary. Also, I found that there were many other people that were interested in a solution, for various reasons, and often with much more direct needs. I had many occasions to discuss with them, and with others that raised doubts or questions. I repeatedly made presentations of the state of the discussion, the alternatives available, the issues involved. I discovered that some of the great concerns that some people raised were not really that important, and that it was possible to explain this rather easily. I and many others did a lot of homework, and a lot of work in other groups. Also, within the past two years, technology has changed a lot. Java was barely visible two years ago. Unicode was one solution of many, with almost no applications available. After these two years, I had finally come to the conclusion that I had a working solution, an upgrade path, a lot of good arguments, and a lot of other people that also cared. It was only then that I started to push strongly for what I and others have come to understand is the right way to go. > That is why I get pissed-off when > Martin sends out statements of consensus which are not true, whether he > realizes them to be false or not. I get pissed-off when we need about two months of piling argument on argument to finally have a clear response from you, and when you seem to ignore all the changes that have been going on in the past two years. > The html-i18n RFC is a travesty > because actual solutions to the problems were IGNORED in favor of > promoting a single charset standard (Unicode). What "actual solutions"? If you think you could have done that work better, why didn't you do it? Why is everybody in the non-English community happy with RFC 2070, and the solutions are adopted by the IAB (as Larry has told me), the W3C and ISO (and of course the browser makers)? For the most crucial part in RFC 2070, namely the definition of ISO 10646 as the document character set, which was already anounced in RFC 1866, why does a developper from a major software company tell in a public workshop that his company did things different originally, but that now they know it better, and they are doing it as we proposed? > I personally would > approve of systems using Unicode, but I will not standardize solutions > which are fundamentally incompatible with existing practice. What "fundamental incompatibility"? Is a recommendation suggesting the use of a particularly well suited character encoding a "fundamental incompatibility" when at present we don't know the character encoding anyway? I very much like the division into PROBLEM 1 and PROBLEM 2 below. PROBLEM 1 is URLs in general, such as domain names, paths, and resource names. PROBLEM 2 is FORMs. They are indeed different, because for PROBLEM 1, we have very sparse namespaces and not very much beyond ASCII yet, whereas for PROBLEM 2, we have very dense namespaces and already a lot of use (and chaos). Of course, there are interactions, because an URL with a # or ? part can also be used as a primary entry point. > PROBLEM 1: Users in network environments where non-ASCII characters > are the norm would prefer to use language-specific characters > in their URLs, rather than ASCII translations. > > Proposal 1a: Do not allow such characters, since the URL is an address > and not a user-friendly string. Obviously, this solution > causes non-Latin character users to suffer more than people > who normally use Latin characters, but is known to interoperate > on all Internet systems. The URL may have been designed as a non-user-friendly address, but to say that it IS not a user-friendly string is ignoring actual practice. I have just had a look at your web page, and you use meaningful URLs, like everybody else. > Proposal 1b: Allow such characters, provided that they are encoded using > a charset which is a superset of ASCII. Clients may display > such URLs in the same charset of their retrieval context, > in the data-entry charset of a user's dialog, as %xx encoded > bytes, or in the specific charset defined for a particular > URL scheme (if that is the case). Authors must be aware that > their URL will not be widely accessible, and may not be safely > transportable via 7-bit protocols, but that is a reasonable > trade-off that only the author can decide. The problem here is that it's not display or 7-bit channels or whatever that makes this proposal fail. It is that URLs are transferred from paper to the computer and back, and that may happen many times. In a recent message to Francois, you seem to have completely ignored this fact, you spoke about an URL only being an URL after it is input into the browser. Yet the draft says: A URL may be represented in a variety of ways: e.g., ink on paper, pixels on a screen, or a sequence of octets in a coded character set. Let's make an example. Assume somebody is constructing an URL using KOI-8, one of the more popular encodings for Cyrillic. She is writing down that URL (in Cyrillic) on paper, and passing it to a friend. The friend types it in, but has no idea (and wouldn't want to care) what encoding it was. Maybe he happens to be on a machine that uses iso-8859-5. He won't be able to find the URL. Obviously, things only work for those URLs for which we have a defined mapping. We can define that mapping in several ways. Currently it is undefined. Possible solutions include a global definition (or at least a recommendation), a solution per protocol or per server, or some kind of tagging (like RFC 1522/2048). Obviously, all but the first solution are very clumsy. Would you like to have an URL such as http://[us-ascii]www.ics.uci.edu/~fielding or so? Probably not. Also, making encoding of URLs depend on protocols and schemes will make generic URL software very difficult and nonextensible. Having to contact a server every time a transition from a binary form (%HH,...) to a visible form with the actual characters (or backwards) is done would be a true waste of connection bandwidth. > Proposal 1c: Allow such characters, but only when encoded as UTF-8. > Clients may only display such characters if they have a > UTF-8 font or a translation table. There are no UTF-8 fonts. And the new browsers actually have such translation tables already, and know how to deal with the fonts they have on their system. And those that don't, they wont be worse off than up to now. > Servers are required to > filter all generated URLs through a translation table, even > when none of their URLs use non-Latin characters. Servers don't really generate URLs. They accept URLs in requests and try to match them with the resources they have. The URLs get created, implicitly, by the users who name resources and enter data. Anyway, it is extremely easy and simple for a server to test whether an URL contains only ASCII, and in that case not do any kind of transcoding. And this test will be extremely efficient and cheap. > Browsers > are required to translate all FORM-based GET request data > to UTF-8, even when the browser is incapable of using UTF-8 > for data entry. Let's come back to forms later. They are a special case. > Authors must be aware that their > URL will not be widely accessible, and may not be safely > transportable via 7-bit protocols, but that is a reasonable > trade-off that only the author can decide. What 7-bit protocols? The internet is 8-bit throughout. Mail is 7-bit, and that might be your concern. If you are worrying about something else, please tell us. Now let's have a look at mail. Assume a user finds a cute URL in a Japanese web page, and this web page is written in EUC (used on Unix boxes) and comes down to the browser in that encoding. Now let's assume the user is on a Mac. When he copies the URL into the clipboard (or maybe earlier), this URL is transcoded to Shift-JIS, because the Mac internally uses Shift-JIS for Japanese. The last time I checked this was quite some time ago, it was probably with Netscape 2. The user then might copy this Japanese URL into a mail he writes to a Japanese friend. When the mail is sent off, it is translated to JIS, because that's the way Japanese mails are sent around. The user might not have set his mail software correctly, and then this might not be done, but then he won't be able to send Japanese mail at all. Now JIS, in MIME called iso-2022-jp, is 7-bit, and that's why these characters will pass through a 7-bit channel nicely. [The PC also uses Shift-JIS internally, and everything should be pretty much the same, but I don't have a box here to test it.] > Implementers > must also be aware that no current browsers and servers > work in this manner (for obvious reasons of efficiency), > and thus recipients of a message would need to maintain two > possible translations for every non-ASCII URL accessed. With exception of very dense namespaces such as with FORMs, it is much easier to do transcoding on the server. This keeps upgrading in one spot (i.e. a server can decide to switch on transcoding and other things if its authors are giving out beyond-ASCII URLs). > In addition, all GET-based CGI scripts would need to be > rewritten to perform charset translation on data entry, since > the server is incapable of knowing what charset (if any) > is expected by the CGI. Likewise for all other forms of > server-side API. Again, this is forms. See later. > Proposal 1a is represented in the current RFCs and the draft, > since it was the only one that had broad agreement among the > implementations of URLs. I proposed Proposal 1b as a means > to satisfy Martin's original requests without breaking all > existing systems, but he rejected it in favor of Proposal 1c. I showed above why Proposal 1b doesn't work. The computer-> paper->computer roundtrip that is so crucial for URLs is completely broken. My proposal is not identical to Proposal 1c. It leaves everybody the freedom to create URLs with arbitrary octets. It's a recommendation. > I still claim that Proposal 1c cannot be deployed and will > not be implemented, for the reasons given above. The only > advantage of Proposal 1c is that it represents the > Unicode-uber-alles method of software standardization. For document content, there is no problem to add a header as in Email or HTTP. And you can even let the user guess the encoding. For URLs, as said above, none of that works. > Proposal 1b achieves the same result, but without requiring > changes to systems that have never used Unicode in the past. > If Unicode becomes the accepted charset on all systems, then > Unicode will be the most likely choice of all systems, for the > same reason that systems currently use whatever charset is present > on their own system. There is a strong need for interoperability. We don't want to force anybody to cooperate, but there are quite some users that want to use their local encoding on their boxes, but still want to be able to exchange URLs the way English speakers do with basic ASCII. The only way to do this is to specify some character encoding for interoperability, and the only such available encoding is Unicode/ISO 10646/JIS 221/KS 5700/... It's not "just another character standard", it's THE international standard to which all other character standards are alligned. > Martin is mistaken when he claims that Proposal 1c can be implemented > on the server-side by a few simple changes, or even a module in Apache. > It would require a Unicode translation table for all cases where the > server generates URLs, including those parts of the server which are > not controlled by the Apache Group (CGIs, optional modules, etc.). There are ways that a CGI script can tell the server which encoding it wants the same way that there are ways to identify the filenames in a particular directory to be in a certain encoding. Anyway, for FORM/query part, please see below. > We cannot simply translate URLs upon receipt, since the server has no > way of knowing whether the characters correspond to "language" or > raw bits. The server would be required to interpret all URL characters > as characters, rather than the current situation in which the server's > namespace is distributed amongst its interpreting components, each of which > may have its own charset (or no charset). There is indeed the possibility that there is some raw data in an URL. But I have to admit that I never yet came across one. The data: URL by Larry actually translates raw data to BASE64 for efficiency and readability reasons. And if you study the Japanese example above, you will also very well see that assuming that some "raw bits" get conserved is a silly idea. Both HTML and paper, as the main carriers of URLs, don't conserve bit identity; they converve character identity. That's why the draft says: The interpretation of a URL depends only on the characters used and not how those characters are represented on the wire. This doesn't just magically stop at 0x7F! > Even if we were to make such > a change, it would be a disaster since we would have to find a way to > distinguish between clients that send UTF-8 encoded URLs and all of those > currently in existence that send the same charset as is used by the HTML > (or other media type) page in which the FORM was obtained and entered > by the user. I have shown how this can work easily for sparce namespaces. The solution is to test both raw and after conversion from UTF-8 to the legacy encoding. This won't need many more accesses to the file system, because if a string looks like correct UTF-8, it's extremely rare that it is something else, and if doesn't look like correct UTF-8, there is no need to transcode. For dense namespaces such as forms, see below. > The compromise that Martin proposed was not to require UTF-8, but > merely recommend it on such systems. But that doesn't solve the problem, > so why bother? The idea of recommending UTF-8 has various reasons: - Don't want to force anybody to use UNicode/ISO 10646. - Don't want to make old URLs illegal, just have a smooth transition strategy. - Fits better into a Draft Standard. > PROBLEM 2: When a browser uses the HTTP GET method to submit HTML > form data, it url-encodes the data within the query part > of the requested URL as ASCII and/or %xx encoded bytes. > However, it does not include any indication of the charset > in which the data was entered by the user, leading to the > potential ambiguity as to what "characters" are represented > by the bytes in any text-entry fields of that FORM. This is indeed the FORMs/Query part problem. It's harder because there is more actual use of beyond-ASCII characters, and because the namespace is dense. But it is also easier because it is mainly a problem between server and browser, with a full roundtrip to paper and email only in rare cases. I like the list of proposals that Roy has made below, because we came up with a very similar list at the last Unicode conference in Mainz. > Proposal 1a: Let the form interpreter decide what the charset is, based > on what it knows about its users. Obviously, this leads to > problems when non-Latin charset users encounter a form script > developed by an internationally-challenged programmer. > > Proposal 1b: Assume that the form includes fields for selecting the data > entry charset, which is passed to the interpreter, and thus > removing any possible ambiguity. The only problem is that > users don't want to manually select charsets. > > Proposal 1c: Require that the browser submit the form data in the same > charset as that used by the HTML form. Since the form > includes the interpreter resource's URL, this removes all > ambiguity without changing current practice. In fact, > this should already be current practice. Forms cannot allow > data entry in multiple charsets, but that isn't needed if the > form uses a reasonably complete charset like UTF-8. This is mostly current practice, and it is definitely a practice that should be pushed. At the moment, it should work rather well, but the problems appear with transcoding servers and proxies. For transcoding servers (there are a few out there already), the transcoding logic or whatever has to add some field (usually a hidden field in the FORM) that indicates which encoding was sent out. This requires a close interaction of the transcoding part and the CGI logic, and may not fit well into a clean server architecture. For a transcoding proxy (none out there yet as of my knowledge, but perfectly possible with HTTP 1.1), the problem gets even worse. > Proposal 1d: Require that the browser include a <name>-charset entry > along with any field that uses a charset other than the one > used by the HTML form. This is a mix of 1b and 1c, but > isn't necessary given the comprehensive solution of 1c > unless there is some need for multi-charset forms. > > Proposal 1e: Require all form data to be UTF-8. That removes the > ambiguity for new systems, but does nothing for existing > systems since there are no browsers that do this. Of course > they don't, because the CGI and module scripts that interpret > existing form data DO NOT USE UTF-8, and therefore would > break if browsers were required to use UTF-8 in URLs. > > The real question here is whether or not the problem is real. Proposal 1c > solves the ambiguity in a way which satisfies both current and any future > practice, and is in fact the way browsers are supposed to be designed. Your analysis is close to perfect, but you forget transcoding proxies. As it is damn clumsy to have all these tables in all browsers and so, we might very soon see browsers that request UTF-8 only and that rely on transcoding proxies to deal with servers with legacy encodings. As long as conversion happens downstreams, but URLs are treated as raw data upstreams, When we discussed this FORMS/Query part in Mainz in March, we also seemed to get stuck in this problem. But we found a way out. It works as follows: We adopt Proposal 2c and an upgrade path to UTF-8. The upgrade consists of a token of information from the FORM to the client, and a token of information from the client back. For the information from the client to the server, we could recycle the proposed ACCEPT-CHARSET *attribute* on INPUT and such from RFC 2070. The idea of this attribute was to be able to indicate for each input field the character encodings that would be acceptable to the server. As Roy correctly says, this is unnecessary overkill. A study cited by Peter Edberg from Apple in his Mainz Unicode conference paper did not show a single use of this attribute. But the study is from November 1995, so it may be dated. Anyway, recycling the ACCEPT-CHARSET attribute would mean that it is only used as ACCEPT-CHARSET="UTF-8", and applies for all relevant fields of a FORM when it appears. Because it is rather long and may already be in use somewhere, an alternative is to define another attribute or a HTTP header for the same purpose. To send information back from the browser, a HTTP header would probably also be the best solution. Ideally, it would be FORM-UTF-8: Yes This has the advantage to be short, and to be easily implementable in its opposite, namely FORM-UTF-8: No so that after a few years of transition, we can phase out this header. It would be sent back only if UTF-8 is requested, of course. The result would be Proposal 2c and UTF-8 if both server and client can handle it. On the server side, especially if we make the information from the server to the client an HTTP header, it can be handled without bothering the CGI script (other that it has to tell us which encoding it wants the data in). > Yes, early browsers were broken in this regard, but we can't fix early > browsers by standardizing a proposal that breaks ALL browsers. Current practice is broken. We have to fix it. And we can do it without breaking anything. > The only > reasonable solution is provided by Proposal 1c and matches what non-broken > browsers do in current practice. Furthermore, IT HAS NOTHING TO DO WITH > THE GENERIC URL SYNTAX, which is what we are supposed to be discussing. Generic URL syntax assumes that URLs are handled as charaters for the ASCII range and as octets for the rest. This doesn't work, and can be fixed. > The above are the only two problems that I've heard expressed by Martin > and Francois -- if you have others to mention, do so now. You got that correct in principle. But what you propose as solutions doesn't work the way you think it would. > Neither of the > above problems are solved by requiring the use of UTF-8, which is what > Larry was saying over-and-over-and-over again, apparently to deaf ears. The problems can be solved by recommending the use of UTF-8, and taking care of the details. And for a truely WORLD Wide Web, the problems should be solved. It is clear that Latin and English have a bit of a head start when it comes to international communication. But there is no serious technology limit anymore to allow people to use their own language and script. > I think both of us are tired of the accusations of being ASCII-bigots > simply because we don't agree with your non-solutions. Neither you nor Larry exactly expressed where you saw the problems. Your behaviour was very difficult to explain to me, and this lead to certain suspicions and accusations. I hope you will study what I wrote above in detail, and see where your ideas for solutions don't work, and where our proposals may work better than you thought. I am looking forward to discussing the details. > Either agree to > a solution that works in practice, and thus is supported by actual > implementations, or we will stick with the status quo, which at least > prevents us from breaking things that are not already broken. Well, I repeat my offer. If you help me along getting into Apache, or tell me whom to contact, I would like to implement what I have described above. The deadline for submitting abstracts to the Unicode conference in San Jose in September is at the end of this week. I wouldn't mind submitting an abstract there with a title such as "UTF-8 URLs and their implementation in Apache". Regards, Martin.
Received on Tuesday, 15 April 1997 10:34:53 UTC