Date: Mon, 14 Apr 1997 21:55:15 +0200 (MET DST) From: "Martin J. Duerst" <email@example.com> To: Larry Masinter <firstname.lastname@example.org> Cc: Francois Yergeau <email@example.com>, firstname.lastname@example.org Subject: Re: UTF-8 URL for testing In-Reply-To: <334DADDC.5CBC@parc.xerox.com> Message-Id: <Pine.SUN.3.96.970411161247.245p-100000@enoshima> On Thu, 10 Apr 1997, Larry Masinter wrote: > Is there any software anywhere in the world that actually generates > URLs like these? All of the examples seem to be carefully > hand-constructed. > Since these URLs are compatible with existing browsers, as you say, > there should not be any difficulty in people running their web servers > this way. Do any web servers in Japan use hex-encoded UTF-8-encoded > Unicode for URLs? > The problem with recommending this method for "Draft Standard" is > not the "six month delay" it takes in getting to draft standard, > it's that we should not recommend something that people aren't > actually going to do. This is not some kind of nit-picky technical > objection, it's fundamental to the process of Internet standards. Thanks for expressing your concerns so that they don't look like procedural nit-picking. Implementors can be split into browser side and server side. On the server side, implementation is trivial for some systems (Plan 9, Windows NT), and more difficult for others (Unix). It is also difficult if you already have (strictly speaking illegal) natively encoded URLs out on your pages or on people's napkins. But for most cases, there are very nice upgrade paths, and fast "workarounds" up to the time there is full configurable support in a server. For Jigsaw, one of the browsers mentionned, implementation should be really easy, as characters are Unicode anyway in Java, and all the conversion support is there with Java 1.1. For the browser side, display is not really much of an issue anymore. Both Netscape and MS Explorer can handle UTF-8 in their newest version. They may have to streamline their interfaces so that UTF-8 also works in the status bar, but that should be a minor issue. Input is a little bit more of a problem, as for example in the case of Netscape, I have found that form fields and such usually lag behind in internationalization by a browser version or two. My judgement of the situation, after several recent conferences in the multilingual/internet/unicode field, is that we are in some kind of oversaturated-fluid situation (my English for physics is not very good, sorry). There are quite some people, in particular browser makers, who are very aware of the problems and would like to know what the solution will be. They are waiting for that small seed cristal to drop in. After that, change might be slower or faster, but it will happen. There is absolutely no question for me that once the Japanese realize that they can use their characters in URLs, they will do so. Currently, they can't, because they have to deal with two or three encodings, and autodetect or "let the user guess" doesn't work well on short strings. > I am eager to actually support internationalization. Many thanks for reassuring me of that. > However, > I think it is counter-productive to foist hex-encoded UTF-8-encoded > URLs (12 bytes to represent one 16-bit Kanji) It is *9* bytes for a 16-bit Kanji. But I agree that this is still rather long. Up to the Unicode conference last September, I was against using UTF-8 for exactly this reason, while many others already pushed for UTF-8. I came to the conclusion that the escaping problems with UTF-7 were more serious, and that the length issue was to a large part a temporary one because when using native encoding or 8-bit UTF-8, it would not be an issue. Even for languages such as Georgian, which are suited for native 8-bit encodings but need three bytes in UTF-8, representing such a character with %HH as we do now and representing it with three bytes in the future will use exactly the same amount of space. I would of course have liked to push directly for 8-bit URLs with UTF-8, but the requirement of having a lowest-common- denominator form, the discussion on URNs, and also the discussion on the URL and URI lists seemed to indicate that it was advisable to first clearly establish the mapping mechanisms for 7-bit URLs, and then go on an start to introduce native URLs from the user interface side. If there should suddenly be consensus for going all the way with one step, I definitely wouldn't mind. > on the rest of the > world merely because a western European and a Canadian like > the idea. Surely we can find a site in Japan, China, Israel, or > Russia that would support exporting their URLs with hex-encoded > UTF-8-encoded URLs, before believing that this isn't yet > another form of Unicode imperialism. The proposal is explicitly formulated as a recommendation to avoid any impression of imperialism. We don't want to force anybody. > Otherwise, we would just > have a pretend solution to a real problem. I understand that you are worrying about Unicode acceptance in Japan. I have been studying this phenomenon, participating in discussions, reading articles, collecting material, and so on, since at least 1993. I have done all this both in Japanese and in English. And I have discussed this with many others, many of them more involved (with actual products) in Japan than myself. The bottom line of all of it is that: - The "arguments" against Unicode are rumors that quickly cease in actual discussion. - The Unicode opponents in Japan are very vocal, and use whatever tactics and contortions they can to spread their stuff. - There are many people clearly in favor of Unicode, but they are less vocal. - In serious technical discussion, for example in Japanese magazines oriented towards computer professionals, the merrits of Unicode are clearly stressed. - Unicode is already used by many Japanese users, without being aware of it. The main text processing softwares (Ichitarou and MSWord) in the market use Unicode. They don't advertise with stickers saying "Unicode inside", so you may very well have the situation that somebody spreads rumors against Unicode while taking advantage of it :-). What people in Japan and elsewhere will do is not "exporting their URLs with hex-encoded UTF-8". In a few months or years, webmasters will just configure their servers to tell them which character encoding the local filesystem or other resources are using. Users will just type in their characters. They won't care whether it's Unicode or SJIS or EUC whatever, and they won't even be surprised that it works, as much as currently US users are not surprised they can use the characters they know in URLs, and they don't care whether they use ASCII or EBCDIC or CPXYZ or whatever. Regards, Martin.