Date: Fri, 7 Mar 1997 14:50:36 +0100 (MET) From: "Martin J. Duerst" <firstname.lastname@example.org> To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU> Cc: URI List <email@example.com> Subject: Re: I18N Concensus - Generic Syntax Document In-Reply-To: <firstname.lastname@example.org> Message-Id: <Pine.SUN.3.95q.970307134328.245D-100000@enoshima> Hello Roy, Many thanks for voicing your concerns and giving me a chance to answer them. On Fri, 7 Mar 1997, you wrote: > >+ It is recommended that UTF-8 [RFC 2044] be used to represent characters > >+ with octets in URLs, wherever possible. > > > >+ For schemes where no single character->octet encoding is specified, > >+ a gradual transition to UTF-8 can be made by servers make resources > >+ available with UTF-8 names on their own, on a per-server or a > >+ per-resource basis. Schemes and mechanisms that use a well- > >+ defined character->octet encoding which is however not UTF-8 should > >+ define the mapping between this encoding and UTF-8, because generic > >+ URL software is unlikely to be aware of and to be able to handle > >+ such specific conventions. > > Here is where you lose me. Don't worry. I hope we will have you back soon again :-). > I have no desire to add a UTF-8 character > mapping table to our server. There is no need to do so. The above is only a *recommendation*. For your server, if you: - Don't have or think you will have anything else than ASCII - Do think the URLs that are used on your server aren't related to characters - Think that it's too difficult to find out which resource name is in which encoding - Think that you need a really small server and the tables you would need would be too large - Think that people using your URLs don't care about knowing what characters are behind them anyway - Think that English (or Suaheli) are enough to serve the world's need, and everybody should learn English for common communication or to make life easier for software engineers - Are just too laizy, have other priorities, don't have the necessary expertise, and so on In all those cases, and probably quite a few more, you don't need to add UTF-8 character mapping facilities to your server. Of course, it is still nice if you try to do it. In addition, for systems that already are Unicode-based, such as Plan9, the Newton, Windows NT, Java,..., you also don't need any tables, just some really short piece of code. > An HTTP server doesn't need one -- its URLs are > either composed by computation (in which case knowing the charset is not > possible) or by derivation from the filesystem (in which case it will use > whatever charset the filesystem uses, and in any case has no way of > determining whether or not that charset is UTF-8). It's not the HTTP server that causes the need to have characters encoded in some defined way. It's the users that want to know what's behind an URL, a facility which is so obviously useful to English users that they might not even notice it, but which is not consistently available to others. Anyway, for computation, there are very many things you can subsume under that term, and quite a few include character manipulation, and in these cases, you usually know (explicitly or implicitly) what character encoding you deal with. If you don't, there are not many useful computations you can make with characters. If you can be more specific for what you think about with computation. For filesystems, there are quite different kinds. You are probably assuming a UNIX-like file system, where the interpretation of filenames in terms of characters depends on the font settings in your xterm or the font in your glass tty ROM. There are other systems where the interpretation of filenames in terms of characters is very clearly defined (see above), so your argument is not general. Anyway, there are various ways to determine the character encoding of a filename on a UNIX-like system. The problem is quite similar to determining the character encoding of the resources themselves, we know that it's not easy, but we know that it's the right thing to do, and that we have find means to make this easier on such systems. Also, there are probably servers available on IBM hosts, where filenames are in EBCDIC. What do those servers do? Do they accept URLs based on octet identity of filenames? Or do they do conversion, so that users get what they expect, namely character identity? For an ASCII URL such as: http://www.ibmmain.com/Fielding.html Do you expect this to look as above, or do you think it is (or should be) http://www.ibmmain.com/%C6%89%85%93%84%89%95%87K%88%A3%94%93 because the server is too lazy to convert from/to EBCDIC? Or would you like, as a data provider, to calculate the names in EBCDIC so that you don't know what they mean, but they appear as meaningful URLs to outside users? I guess the only sensible answer here, even for you, is that the server does conversion. What we are proposing with UTF-8 is just that not only English/Latin users get this nice and natural service, but that there is at least the *possibility* that others can establish it, too, even if you yourself don't want to get involved in it. Of course, there is some danger that after some time, with enough UTF-8 servers and clients around, users will just expect that it works on all servers and clients, and that you might get forced by your user base to do some implementation. But that's probably a long time ahead, and would just be the ultimate proof of the desirability of a consistent character encoding in URLs, and not an argument to try to avoid it. > The server doesn't care > and should not care. The users care, and that's why the servers probably should care. Or do you just serve random data, because anyway the server doesn't care whether the users get something reasonable? > It is therefore inappropriate to suggest that it should > add such a table when doing so would only bloat the server and slow-down > the URL<->resource mapping process. There is no suggestion to add a table. Implementation is up to you. A table-based mapping can be extremely fast. It won't slow down the process if it's done correctly. Depending on the character encodings you have on your server, tables don't have to be very large; there are numerous very efficient techniques for sparsely populated tables. Also, there is no need to have the conversion inside the server. For example, if your server is file-based, you can have a small program running once a day that for every filename in your legacy encoding creates a link using UTF-8. Here, the disadvantage of UNIX-like file systems that don't have a defined character encoding for filenames turns into an advantage. This makes the resources available under both the legacy-encoded URLs and the UTF-8 encoded URLs, which is nice to provide a smooth upgrade path (as discussed in my original outline). > >> Data corresponding to excluded characters must be escaped in order > >> to be properly represented within a URL. However, there do exist > >> some systems that allow characters from the "unwise" and "national" > >> sets to be used in URL references (section 3); a robust > >> implementation should be prepared to handle those characters when > >> it is possible to do so. > > > >Change to: > > > >There exist some systems that allow characters/octets from the > >"unwise" and "others" sets to be used in URL references (section 3). > >Until a uniform representation for characters within URLs is firmly > >established, such practice is not stable with respect to transcoding > >and therefore should be avoided. > >However, robust implementations should be prepared to handle those > >octet values when it is possible to do so. > > No thanks -- the existing paragraph is far better. Transcoding is > not an issue unless they are already violating the specification, > in which case they are prepared to suffer the consequences. > The purpose of the paragraph is to prevent an implementer from > interpreting the spec too literally and crashing on a non-urlc > character. The problem is that a lot of them are currently prepared to "suffer" the consequences because it just works, because there are no visible consequences. And as long as it works, people will continue to use it because it provides some very convenient features to them, just disallowing it officially won't keep them from using it. Telling them where and why it will stop to work will hopefully let some of them understand and will have them (for the time being, at least) discontinue this practice it. Regards, Martin.