- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Wed, 23 Apr 1997 18:06:07 +0200 (MET DST)
- To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
- Cc: uri@bunyip.com
The message I am answering to here lies a while back, it raises some important issues, and I don't want to let it unanswered (and am very interested on getting many points below answered). If necessary, please split your messages up by topic. On Tue, 15 Apr 1997, Roy T. Fielding wrote: > >What "fundamental incompatibility"? Is a recommendation suggesting > >the use of a particularly well suited character encoding a > >"fundamental incompatibility" when at present we don't know > >the character encoding anyway? > > Yes, because at present we don't tell the client to transcode the URL. > Any transcoding is guaranteed to fail on some systems, because the > URL namespace has always been private to the generator (the server in > "http" or "ftp" or "gopher" URLs, the filesystem in "file" URLs, etc.). As far as current pages contain only URLs with %HH for "dangerous" octets, there is no transcoding (except for ASCII<->EBCDIC and the like, which we will ignore here). And this is currently the only legal use. After we have firmly established UTF-8 as a recommendation for URLs, we can then go on and allow URLs in native encoding. These will be transcoded wherever transcoding of the carrying document happens, and will finally be transcoded into UTF-8 (and converted to %HH if necessary) before being sent to the server. This covers all currently legal "moved around" URLs. For the currently non-legal "moved around" URLs and for the URLs generated at the browser (FORMs), the solution works as follows: For non-legal "moved around" URLs (please note that according to Roy's attitude to standards, we wouldn't be required to take care of them, but if we can, why shouldn't we), after trying with transcoding to UTF-8 as described above, we try without transcoding. This covers the case that we have received the document from its original source without any intermediate transcoding (which cannot be guaranteed, but should be fairly common at present). We have a second network round-trip, but as this is only to recover an illegal case, it's not too bad. For URLs generated at the browser (FORMs), we have to exchange some information between server and browser (FORM-UTF8: Yes). This again covers two cases, namely the case that the server can handle UTF-8 and the case that the server and the browser use the same charset. > What is more likely: the client > knows how to transcode from the data-entry dialog charset to UTF-8, > or the user is using the same charset as the server? On my system, the > latter is more likely. I suspect that this will remain an interoperability > problem for some time, regardless of what the URL standard says. What kind of system are you using? And what kind of characters on that system? Anyway, let's have a look at some cases. For Western Europe, the Mac, DOS, Windows, and Unix boxes all use their own code pages. Unix is mostly Latin-1 now, but there are some legacy systems (you can have the old HP encoding on a HP box,...). Windows CP 1252 is almost, but not quite, equal to 8859-1. For Eastern Europe and for Cyrillic, the situation is worse. For Japanese, you have EUC on Unix and SJIS on PC/Mac. I could go on and on. And this won't improve very quickly in the next few years. To deploy UTF-8 conversion capabilities where they are not yet available is much easier. The situation is actually a little bit better because of the fact that Latin-1 is very well established for the Web (thanks TBL for this one!). For Western Europe, things work therefore despite having different charsets on different machines, because somebody realized that characters, and not octets, have to be preserved. The only thing we have to do now is to realize that this was actually a very good idea, and to just apply it to the whole world, while guaranteeing smooth transition from the mess we have now. > Proposal 1b allows cooperating systems to have localized URLs that work > (at least locally) on systems deployed today. The web never was local, and will never be. Something that only works locally (where locally means a given language/script *and* a given kind of computer) is a dead end for the web. What we need is technology that can be made to work everywhere where there are users that want to use it. > >> Proposal 1c: Allow such characters, but only when encoded as UTF-8. > >> Clients may only display such characters if they have a > >> UTF-8 font or a translation table. > > > >There are no UTF-8 fonts. And the new browsers actually have such > >translation tables already, and know how to deal with the fonts > >they have on their system. And those that don't, they wont be worse > >off than up to now. > > Unless they are currently using iso-8859-1 characters in URLs, on pages > encoded using iso-8859-1, which are also displayed correctly by far more > browsers than just the ones you refer to. Likewise for EUC URLs on > EUC-encoded pages, and iso-2022-kr URLs on iso-2022-kr-encoded pages. > The fact is, these browsers treat the URL as part of the HTML data > stream and, for the most part, display it according to that charset > and not any universal charset. It is very clear to all of us that such URLs should be treated in the charset of the document *as long as they are part of the document*. The discussion with Keld just recently confirmed this. Anything else would give big headaches everywhere. The question is what happens when the URLs are passed from the HTML document to the URL machinery in the browser. If they are interpreted "as is", i.e. they are seen as octets, that works as long as the HTML document came straight from the server and was set up carefully. However, if there is a transcoding proxy, or transcoding happened already on the server, or the URL took some other steps from the "point of generation" (the original filename or whatever) to the HTML page, for example by having been cut-and-pasted or by having been transcribed on paper or by email, then nothing is guaranteed. And that's why these URLs are currently illegal :-). > >> Servers are required to > >> filter all generated URLs through a translation table, even > >> when none of their URLs use non-Latin characters. > > > >Servers don't really generate URLs. They accept URLs in requests > >and try to match them with the resources they have. The URLs get > >created, implicitly, by the users who name resources and enter data. > > The Apache source code is readily available and includes, as distributed, > five different mechanisms that generate URLs: directory listings, > configuration files (<Location>), request rewrite modules > (redirect/alias/rewrite), request handling modules (imap), and CGI scripts. > The first two are part of the server core and related to the filesystem, > and thus could be mapped to a specific charset and thereby to a translation > table per filesystem, with some overhead. The next two (modules) can be > plugged-in or out based on user preference and there is no means for the > server to discover whether or not they are generating UTF-8 encoded URLs, > so we would have to assume that all modules would be upgraded as well. > CGI scripts suffer the same problem, but exacerbated by the fact that CGI > authors don't read protocol specs. Many thanks for these details. One can indeed say that in these cases, the server is generating URLs. Another way to see it is that the server sends out URLs that already exist somewhere. But this is a theoretical discussion. For the implementation, I think there are three important aspects we have to consider. One is the question of "what can we do between the original point where we get the URL (or whatever we call it) and the point where it is sent to the browser?". This is discussed above nicely by Roy. The second question is "Where do these URLs actually come from?". The third question is "where do these URLs point to?". For the second question, we have several possibilities. In the case of directory listings, it's the file system itself that gives us the data, in the case of rewrite and handling modules, the data comes from files on the server. In both cases, we can assume that we know (per server, per directory, or per file) what encoding is used, so that we can make the necessary transformations. In the case of CGI scripts, we can make the assumption that each CGI script likewise has its "charset". Scripts that deal with serveral "charset"s (such as those in Japan that have to guess what they get sent in a query part) are not that much of a problem, because their authors know the issues, and will be happy if their job gets easier. The question of "where do these URLs point to" is important because for example if we get an URL in Latin-2 from a file, we could convert that to UTF-8 if we know that it points to our server, because we know that we accept UTF-8. On the other hand, if it points to another server, we should be careful with converting it, because we don't know whether that other server accepts UTF-8 or not. This distinction is not too difficult if we do a full parse of an HTML document, but that is probably too much work to do, and for some other kinds of documents, it may not be that easy (and we don't want, on a server, to know all document types). One important point is of course how the server currently handles these things. For example, are non-ASCII URLs converted to %HH, are non-ASCII URLs checked for and complained about, or what is being done? In the case of redirects, the URL is escaped, in the case of mod_imap, I failed to find the place where this is done. Anyway, what can we do? As Roy has said, for the internal stuff, we can definitely do more than for CGIs. For CGIs, and this is the last line of defense also for the rest, we can probably do two things: - Either leave the CGI alone, in its own world, with its own "charset", and not do any transcoding. We may call this the raw mode. - Have the CGI "register" with some Apache settings to tell us what "charset" it is working in, so that we can do the appropriate translations. This registration might be done separately for incomming (query part) and outgoing (result) stuff. We also have to distinguish between old and new CGIs. Old CGIs should be left alone (whatever URLs they generate for our server will be cared for by our backwards compatibility measures), new CGIs should be correctly registered and should only work with the new URLs. Other things that the Apache group should try to think about (I would definitely be willing to help) is how to be able to write out all those pages that contain redirects and stuff and that in some cases get seen by the user in different languages. This would be a great service to the various users, and could probably be implemented with very few additional utility functions. This would then also determine the "charset" of the outgoing message, and this in turn would help to know what exactly to do with URL for which we know what characters they represent. > I am not talking theoretically here -- the above describes > approximately 42% of the installed base of publically accessible > Internet HTTP servers. It would be nice to have a standard that > doesn't make them non-compliant. We definitely agree here. And as there is no requirement, just a *recommendation*, that won't happen. The exercise in backwards compatibility and upgrading strategy we are doing above is a very good thing to do in my eyes, but it is not necessary for the UTF-8 for URL recommendation to be workable. That a server stays exactly the way it is now is something that is completely accounted for in our proposal. And for new server installations, we don't have to worry about CGI scripts taking assumptions that were never guaranteed by the standard. > >> Implementers > >> must also be aware that no current browsers and servers > >> work in this manner (for obvious reasons of efficiency), > >> and thus recipients of a message would need to maintain two > >> possible translations for every non-ASCII URL accessed. > > > >With exception of very dense namespaces such as with FORMs, it is > >much easier to do transcoding on the server. This keeps upgrading > >in one spot (i.e. a server can decide to switch on transcoding and > >other things if its authors are giving out beyond-ASCII URLs). > > As a server implementer, I say that claim is bogus. It isn't even > possible to do that in Apache. Show me the code before trying to > standardize it. Well, I don't have the code ready. But our basic problem is the following: We have a file system in encoding X, and we want to be prepared to receive URLs in encoding X (for backwards compatibility issues) and in UTF-8. We can easily do this with a module that combines rewrite and subrequests. Here is the sketch of an algorithm that returns a (possibly) valid local URL: input: url as it came in. if (url is ASCII-only) return url; if (url is valid UTF-8) { url2= convert-to-X(URL); if (subrequest with url2 successful) return url2; else return url; } else /* URL can only be in encoding X, or wrong */ return URL; > >I showed above why Proposal 1b doesn't work. The computer-> > >paper->computer roundtrip that is so crucial for URLs is > >completely broken. > > Not completely. It only breaks down when you transmit localized > URLs outside the local environment. That is the price you pay. We are not ready to pay this price. It strongly discriminates against non-English and non-Latin users. It is not due of a real WORLD-wide web. If we have to pay it, there is something wrong with the design of URLs. We know that the solution might not just happen magically, but that's why we are working on it. > >My proposal is not identical to Proposal 1c. It leaves > >everybody the freedom to create URLs with arbitrary > >octets. It's a recommendation. > > Then it doesn't solve the problem. Well, it solves the problem for those that want the problem being solved. That's what counts. Currently, even those that want the problem solved, and that know how this can be done, can't solve it. > If you want interoperability, you must use US-ASCII. Getting > interoperability via Unicode won't be possible until all systems > support Unicode, which is not the case today. Setting out goals > for eventual standardization is a completely different task than > actually writing what IS the standard, which is why Larry asked > that it be in a separate specification. If you are convinced that > Unicode is the only acceptable solution for FUTURE URLs, then > write a proposed standard for FUTURE URLs. To make things easier, > I have proposed changing the generic syntax to allow more octets > in the urlc set, and thus not even requiring the %xx encoding. > In that way, if UTF-8 is someday accepted as the only standard, > then it won't conflict with the existing URL standard. If the standard gets changed so that %HH is no longer necessary, this has to be done in a very careful way. If it is done carelessly, it will do more harm than benefit. Actually, if we would take the position that octets beyond 0x7F and without %HH-encoding are illegal and therefore don't need backwards compatibility, and if we change the standard as you propose it to allow more characters (not octets), then we can nicely deal with everything, and don't even need all those backwards compatibility things we were speaking about. We just declare the following: - Whatever is in %HH should be handled as octets and passed along as such in %HH form. - Whatever is characters (outside ASCII) should be treated as characters and passed along as such. When it is submitted to a server, the characters should be encoded as UTF-8. > >> We cannot simply translate URLs upon receipt, since the server has no > >> way of knowing whether the characters correspond to "language" or > >> raw bits. The server would be required to interpret all URL characters > >> as characters, rather than the current situation in which the server's > >> namespace is distributed amongst its interpreting components, each of which > >> may have its own charset (or no charset). > > > >There is indeed the possibility that there is some raw data in an > >URL. But I have to admit that I never yet came across one. The data: > >URL by Larry actually translates raw data to BASE64 for efficiency > >and readability reasons. > >And if you study the Japanese example above, you will also very well > >see that assuming that some "raw bits" get conserved is a silly idea. > >Both HTML and paper, as the main carriers of URLs, don't conserve > >bit identity; they converve character identity. That's why the > >draft says: > > > > The interpretation of a URL > > depends only on the characters used and not how those characters > > are represented on the wire. > > > >This doesn't just magically stop at 0x7F! > > Transcoding a URL changes the bits. If those bits were not characters, > how do you transcode them? Moreover, how does the transcoder differentiate > between %xx as data and %xx as character needing to be transcoded to UTF-8? > It can't, so requiring UTF-8 breaks in practice. No. If something is %HH, it is always octets, and doesn't have to be transcoded (it would be rather difficult to change current transcoders to do that). This is part of our original proposal, and it seems that you didn't understand this until now. What is transcoded, however, is the URLs that are (currently illegally) sent as actual characters. The assumption is of course that for those URLs that are actual data, the %HH escaping is always used, and only for those that represent characters (the big majority), actual characters outside ASCII are used. > >> Even if we were to make such > >> a change, it would be a disaster since we would have to find a way to > >> distinguish between clients that send UTF-8 encoded URLs and all of those > >> currently in existence that send the same charset as is used by the HTML > >> (or other media type) page in which the FORM was obtained and entered > >> by the user. > > > >I have shown how this can work easily for sparce namespaces. The solution > >is to test both raw and after conversion from UTF-8 to the legacy encoding. > >This won't need many more accesses to the file system, because if a > >string looks like correct UTF-8, it's extremely rare that it is something > >else, and if doesn't look like correct UTF-8, there is no need to > >transcode. > > You keep waving around this "easily" remark without understanding the > internals of a server. If I thought it was easy to do those things, > they would have been implemented two years ago. Well, the whole thing was discussed ad extenso in ftp-wg, and it was decided that accepting filenames in a legacy encoding and in UTF-8 and figuring out which one it was was definitely implementable for an ftp server. Please read the newest ftp internationalization draft or the archives of the group. Of course, HTTP and FTP are not the same, but in many ways, they are similar. > What do you do if you have two legacy resources, one of which has the > same octet encoding as the UTF-8 transcoding of the other? This can indeed happen, but it is extremely rare. For some legacy encodings (e.g. KOI-8 which is very popular in Russia), the probability is zero. For Latin-1, the situation is the following: The legal Latin-1 sequences that are also legal UTF-8 sequences and that when interpreted as UTF-8 only contain characters from the Latin-1 character set are two-letter sequences of the following form: The first letter is an A-^ or an A-~. The second letter is 0xAx or 0xBx, i.e. not a letter but a symbol such as inverted question mark, 1/4, +-, cent sign, or so on. Please note that any single one-letter beyond-ASCII Latin-1 character (which is the most frequent way these letters appear in Latin-1) makes it impossible to be UTF-8. Even in cases where something can theoretically be UTF-8, there are a lot of things that in practice will make the probability that there are clashes EXTREMELY low. For examlpe, I asked a friend of mine maintaining a large Japanese<->English dictionary to cull all the Japanese entries in his dictionary that when encoded as EUC or SJIS could possibly be UTF-8. For EUC, he found some 2.7% or about 3000 entries. For SJIS, he found just two entries. Having a look at the EUC entries in a UTF-8 editor, I find a lot of ASCII characters (wrongly coded as two or three bytes but not culled), a lot of accented characters and characters from all kinds of alphabets including Greek, Arabic, and Hebrew, and a lot of undefined codepoints. In the whole file, I found one single Kanji (a Chinese simplified one that would never appear in Japanese) and no Hiragana or Katakana at all. Of course, this sample may not be representative for Japanese filenames, but if the chance that something is legal UTF-8 is so low, and the chance that something is reasonable UTF-8 is even much much lower, how low do you think the chance that somebody will have just exactly those two filenames that produce a clash? If it weren't that this thing could be tweaked and hacked, I would very much volunteer to include a check in Apache for such a case (*this* would come at significant file system access costs) and send the first person that encounters this something really nice for 100$ or even more :-). And I bet I could keep these 100$ for quite a long time. > How do you > justify the increased time to access a resource due to the failed > (usually filesystem) access on every request? Why should a busy server > do these things when they are not necessary to support existing web > services? There is no increased time. It's either legal UTF-8 (in which case the chances that it indeed is UTF-8 are 99% or higher) or it is not UTF-8 (in which case the chances are 100% that it is the legacy encoding). This point was mentionned several times by now, and I hope you finally understand it. > After all, there are at least a hundred other problems that the draft > *does* solve, and you are holding it up. We don't hold it up. We have a clear proposal (the currently officially proposed wording was drafted by you), and you just have to include it or propose something else that meets the same intentions. Many people in this group have clearly stated that the draft as it currently stands is not sufficient. > >> Proposal 1c: Require that the browser submit the form data in the same > >> charset as that used by the HTML form. Since the form > >> includes the interpreter resource's URL, this removes all > >> ambiguity without changing current practice. In fact, > >> this should already be current practice. Forms cannot allow > >> data entry in multiple charsets, but that isn't needed if the > >> form uses a reasonably complete charset like UTF-8. > > > >This is mostly current practice, and it is definitely a practice > >that should be pushed. At the moment, it should work rather well, > >but the problems appear with transcoding servers and proxies. > >For transcoding servers (there are a few out there already), > >the transcoding logic or whatever has to add some field (usually > >a hidden field in the FORM) that indicates which encoding was > >sent out. This requires a close interaction of the transcoding > >part and the CGI logic, and may not fit well into a clean > >server architecture. > > Isn't that what I've been saying? Requiring UTF-8 would require all > servers to be transcoding servers, which is a bad idea. I am certainly > not going to implement one, which should at least be a cause for concern > amongst those proposing to make it the required solution. Just a moment. Transcoding happens at different points. What I mean by a transcoding server is a server that transcodes the documents it serves. For example, it would get a request with Accept-Charset: iso-8859-2, iso-8859-5, iso-8859-1;q=0.0 for a German document it keeps in iso-8859-1. It would take that document, transcode it to iso-8859-2, and serve it. This is a functionality that would be rather good to have for a server, in order to keep clients small. > >For a transcoding proxy (none out there > >yet as of my knowledge, but perfectly possible with HTTP 1.1), > >the problem gets even worse. > > Whoa! A transcoding proxy is non-compliant with HTTP/1.1. > See the last part of section 5.1.2 in RFC 2068. The last section of 5.1.2 prohibits URLs that are sent upstream (from client via proxy to server) to be changed. There is nothing that prohibits a proxy that gets a request with Accept-Charset: iso-8859-2, iso-8859-5, iso-8859-1;q=0.0 from a client and that retrieves a resource tagged charset=iso-8859-1 to do the neccessary translations. Indeed, the design of the whole architecture just suggests that such and similar translations are one of the main jobs proxies are made for (besides security and caching). The fact that changing URLs upstream is explicitly prohibited very clearly shows that treating URLs as octets and relying on your proposal 2c (send the URL back in the same encoding you got the FORM page) is going to break sooner or later. > >Well, I repeat my offer. If you help me along getting into Apache, > >or tell me whom to contact, I would like to implement what I have > >described above. The deadline for submitting abstracts to the > >Unicode conference in San Jose in September is at the end of this > >week. I wouldn't mind submitting an abstract there with a title > >such as "UTF-8 URLs and their implementation in Apache". > > Sorry, I'd like to finish my Ph.D. sometime this century. > How can I help you do something that is already known to be impossible? By studying what I and others write and finding out that it is not as impossible as you thought it is :-). Regards, Martin.
Received on Wednesday, 23 April 1997 12:08:21 UTC