- From: Edward Cherlin <cherlin@newbie.net>
- Date: Thu, 24 Apr 1997 00:09:11 -0700
- To: uri@bunyip.com
Larry Masinter <masinter@parc.xerox.com> wrote: >> Well, they print something like http://WEB.SANYO.CO.JP/FOODSHOP, >> where upper case is Japanese characters. >Actually, this is unsatisfactory. What, exactly, would they >print? Would they print "http://" too? Will Japanese users >find that familiar and comfortable? I'm afraid "something >like" isn't useful as a specification. OK, for a hypothetical Isshin Depaato/Produce (because I don't have the Kanji for Sanyoo handy) they use three different switchable input methods (ASCII [romaji] keyboard; Kana to kanji or, for foreigners, romaji to kanji conversion; kana keyboard) to enter a text which we can represent with ASCII and Unicode as: h t t p : / / U+4E00 I [kanji] one U+5FC3 Shin [kanji] heart U+30C2 de [katakana] U+30D1 pa [katakana] U+30FC -- [katakana] vowel extension U+30C8 to [katakana] / U+516B ya [kanji] 8 U+767E o [kanji] hundred U+5C4B ya [kanji] shop / Actually the software will most likely use one of the double-byte Japanese codes that includes places for the ASCII characters, and we are supposing that this will be translated to pure Unicode, and then to UTF-8 and so on. Now I am not giving the full set of encodings that this URL goes through, because this translation is algorithmic. You can look up the Unicode for the ASCII characters yourself, if you insist on having them. If absolutely necessary, I can specify the keystrokes and screen display for the whole process, using the multilingual tools available to me. Since they are not the same as the tools used in Japan, I don't think this would add anything to the demonstration. If anyone objects that this is still not definite enough, I invite them to test out a Mac with a Kanji OS, or a PC with Kanji Windows, such as would appear in the Japanese market. The details differ, but the process of entering text is basically the same--multiple input methods which together handle ASCII, symbol and dingbat fonts, Kanji, and both types of kana. Japanese schoolchildren are all taught all four scripts in school. Japanese word processing software supports all four scripts. Books, magazines, and newspapers in Japan routinely mix all four scripts. Even street, train, and subway signs in Tokyo and some other cities mix all four scripts. > >> Of course, for this we have >> to assume that DNS works with characters beyond ASCII, but that's >> a separate problem that can be solved (see draft-duerst-dns-i18n-00.txt). > >I fundamentally disagree with your idea that we can >promote the solution to a problem in pieces, where the >pieces, just by themselves, don't actually solve a >problem and, in fact, introduce interoperability >difficulty. So I'm unwilling to "assume" that other >pieces of the solution will be introduced in order >to make a whole. As far as I have heard in this discussion, the only places where interoperability would be a problem with this proposal are already an insoluble problem without it. Do we object so strongly to breaking usage that violates the existing standards, and is already broken? >> This is entered as such into a browser. We assume that those users >> that are the target of the Sanyoo depaato food shop page can read >> Japanes and have equipment that allows them to input Japanese. >> I won't go into the details of entering the corresponding characters, >> it's a process the Japanese computer users are very familliar with. > >No, I'm sorry, this is completely inadequate. I'm vaguely familiar >with a number of of Japanese typing methods, and I believe >that you've not been specific enough. What happens with the >codes for "http://", for example, since these are not 'Japanese >characters'? What about unusual names which seem to be printed >with furigana in Japanese newspapers? I see you know enough about Japanese writing to be dangerous, but not enough to be helpful. :-) I repeat: The ASCII characters are typed on a romaji keyboard layout, which students must learn in school. The double-byte Japanese character codes include ASCII. Unusual names using characters not in the standard code sets cannot be used in URLs either in current non-standard practice or in our proposal based on Unicode. This is not an issue, since no page designer or Webmaster would try to use characters that cannot be represented in the computer. Names using characters in the code set that require a pronunciation guide when printed are a little harder to enter than more common characters, but they can be entered using kana conversion if the pronunciation is in the conversion dictionary, or by radical/stroke count/menu selection or one of the other indexing methods otherwise. Perhaps the code point number would have to be provided with such a character when used in an URL, alongside the pronunciation. >> The browser then would convert the Japanese characters into UTF-8 >> and (add %HH encoding) and pass the URL to the resolver machinery, >> where the host part would be resolved with DNS, and then the machine >> at the corresponding IP number would be contacted with HTTP. > > >This discussion applies only to HTTP URLs, though. You're >proposing that the recommendation be put into place for >all existing URL schemes and new versions of them, too. This is merely an illustration. It is obvious that we would have to do the same process to support ftp:, gopher:, telnet: and other URLs using non-ASCII Unicode characters. Again, the DNS would need the ASCII encoded domain name, and the server at that site would deal with the rest. I have said at least ten times in this discussion, with no acknowledgement from anyone, that we are to assume that people will not publish Unicode URLs without knowing that their servers support them. If I am going to create an ftp: site, and I don't check what version of what ftp server I'm using, I'm a fool, and likewise for gopher: and telnet: and the others. If I put out an https: URL and I don't have a secure server to receive it, I'm a fool. If I intend to accept encoded UTF-8, I need to find out how my server can deal with it. If I don't intend to accept it, I can regard encoded UTF-8 in URLs as plain ASCII, without breaking any process that is not already broken. >> That >> machine would of course have been set up so that the correct page >> is returned. > >How, please, is the machine set up? What has to be done at >the server & system administration level? What's the transition >strategy for a server that wants to serve current clients >as well as these new browsers that can deal with the proposal >you're promoting? Strategy 1. Use the ASCII-encoded URL as is to find the requested page (i.e. use the encoded directory and file names, or whatever), or pass encoded data to the application for processing, including determining the need for character set conversion. In other words, make no changes to the server whatsoever, and still have full support for UTF-8 URLs. Strategy 2. Put in as many bells and whistles as you like. Let the server convert, where possible, from ASCII to UTF-8 to Unicode to the local character set. Let the server convert other %HH-encoded data to whatever data format the page designer requests. Neither of these strategies will work as well as we want in all cases. The cases that fail, however, are broken today. We are not introducing new points of failure. The full solution will require detailed design in order to succeed in all cases, but the interim solution only needs to permit success *for those motivated to get the software they need*. >> I hope this explanation is detailled enough. If you don't understand >> some part of it, please tell us. > >As you see, it was inadequate for the purposes of >being a stand-in for 'running code': there are >a number of unresolved design issues in your plan, >those design issues must be resolved before interoperable >implementations can be deployed, and I'm uncertain >as to whether the results, when taken in total, >actually solve the problem you set out to solve, >or even improve the situation significantly. And >given the difficult transition strategy and lack >of interoperability with currently deployed systems, >I doubt that a proposal will actually be adopted >unless that's so. Problems to be solved: Provide for URLs which can be (but are not required to be) displayed in a non-Latin or extended Latin script meaningful to the user. Provide for receipt of non-ASCII text data in URLs, including multilingual, multiscript form input. We have proposed an ASCII encoding of Unicode to be recommended, but not required, in both of these cases. We have shown how such URLs can be generated using simple filter programs. We have demonstrated the use of such URLs in existing browsers. We have heard from major implementors that this is their intended future direction. We have heard from users that they need this capability now. We have heard that numerous other standards will incorporate Unicode, and provide for the necessary character set encodings and conversions. On the other hand, it may not matter much whether this gets into this standard. We have begun the process of implementation, and those of us who want to do it are agreed that this very simple and obviously incomplete proposal solves many of our problems, so we will go do it no matter what anybody else says. Then we will come back and ask for a detailed standard based on what we have done. >That's why "something like" is inadequate above. >If someone had running code, they could just run it >and show us what the results were. > >Regards, > >Larry >-- >http://www.parc.xerox.com/masinter Very well, we have had several people post patches for servers. Can we have the servers themselves set up somewhere, with a selection of pages in various scripts and with assorted URL formats? We'll want at least Japanese Kanji and kana URLs, and a multilingual form of some sort. "I'll bet you can't turn *that* into a pumpkin."--Witches Abroad, by Terry Pratchett -- Edward Cherlin cherlin@newbie.net Everything should be made Vice President Ask. Someone knows. as simple as possible, NewbieNet, Inc. __but no simpler__. http://www.newbie.net/ Attributed to Albert Einstein
Received on Thursday, 24 April 1997 04:06:10 UTC