Date: Mon, 21 Apr 1997 17:46:36 +0200 (MET DST) From: "Martin J. Duerst" <firstname.lastname@example.org> To: Larry Masinter <email@example.com> Cc: firstname.lastname@example.org, email@example.com, Harald.T.Alvestrand@uninett.no Subject: Re: revised "generic syntax" internet draft In-Reply-To: <335B6488.1BD1@parc.xerox.com> Message-Id: <Pine.SUN.3.96.970421153537.245K-100000@enoshima> On Mon, 21 Apr 1997, Larry Masinter wrote: > > Well, they print something like http://WEB.SANYO.CO.JP/FOODSHOP, > > where upper case is Japanese characters. > > Actually, this is unsatisfactory. What, exactly, would they > print? Would they print "http://" too? Will Japanese users > find that familiar and comfortable? I'm afraid "something > like" isn't useful as a specification. I put in the "something like" because I don't assume you, or everybody else in the group, could view or read actual Japanese characters. As for the "http://", I guess neither Japanese nor US users find that very comfortable. And on many browsers, you can just leave it out. > > Of course, for this we have > > to assume that DNS works with characters beyond ASCII, but that's > > a separate problem that can be solved (see draft-duerst-dns-i18n-00.txt). > > I fundamentally disagree with your idea that we can > promote the solution to a problem in pieces, where the > pieces, just by themselves, don't actually solve a > problem and, in fact, introduce interoperability > difficulty. So I'm unwilling to "assume" that other > pieces of the solution will be introduced in order > to make a whole. We don't have to "assume". Indeed we don't know whether the DNS guys will be convinced to adopt a solution for i18n URLs, and how this solution will look. And as long as that work hasn't been done, we can't use beyond-ASCII characters in domain names, and so we won't. But what if they would decide to work on a solution and to deploy it, but would have to decide "well, we can't, because character encoding in URLs is not working"? The basic problem currently is that URLs have an extremely well-defined way to represent ASCII characters, but are useless when it comes to represent any other characters. The IMAP case has very well shown this. The IMAP protocol has a very specific way of encoding beyond-ASCII characters (called "modified UTF-7"). They would very much like to know how such characters are encoded in URLs, so that they can have the respective URLs and can tell implementors what to do to convert from URLs to IMAP protocol calls. When they turn to the URL definition (as it stands with the current draft), the only answer they get is "it's undefined". What this translates to is "it's am mess, and we didn't care to solve it". > > This is entered as such into a browser. We assume that those users > > that are the target of the Sanyoo depaato food shop page can read > > Japanes and have equipment that allows them to input Japanese. > > I won't go into the details of entering the corresponding characters, > > it's a process the Japanese computer users are very familliar with. > > No, I'm sorry, this is completely inadequate. I'm vaguely familiar > with a number of of Japanese typing methods, and I believe > that you've not been specific enough. What happens with the > codes for "http://", for example, since these are not 'Japanese > characters'? What about unusual names which seem to be printed > with furigana in Japanese newspapers? Short answer: Japanese computer user know how to enter Japanese. For URLs, we can safely assume this, and don't have to make it part of the spec. The layout of the QUERTY keyboard also is not part of the URL spec, and if it was, it wouldn't help me, because I'm using a different keyboard layout. Long answer: Some solutions for "http://" have been discussed above. If the user really wants to type these letters, Japanese input methods have a way to change to "half-width English letters" input, as they call it. This method can be a menu point, a button on a floating window, a keyboard shortcut, or usually several of them. I hope you don't ask me how a Japanese user is selecting a menu point or pushing a button, or what exactly the shortcuts are. For general syntax characters such as "/", ".", ":", and so on, they are usually accessible even when we are in Japanese input mode. In some configurations, they produce the "wide" variant of these letters, but that would have to be set differently for the URL input field. Careful UI design at work and a "quality of implementation" issue, nothing for an internet standard to worry about. Unusual names and such are printed in newspapers with kana following in parentheses, only magazines print them as furigana. Because most input methods are based on phonetic input, the furigana actually help to input the characters (as long as the input method dictionary has the name in question available). If not, there are other ways to enter the characters. In many cases, the constituent characters of the name are very basic and familliar characters, so inputting them one-by-one is very easy. The names Igarashi or Hasegawa may serve as examples. If that also fails, one usually has other ways to look up and input a character, for example by radical or by actual code number. There are/were at least 25 different versions out on the market of what is called "Wordprocessor Dictionary" that allow to find characters with traditional lookup methods and find its JIS code. I hope we don't have to go into the details of traditional character dictionary lookup to complete our work on URL internationalization :-). Anyway, people interested in users being able to easily input the URLs they have published would most probably avoid such difficult cases wherever possible. Conclusion: We are discussing about URL internationalization here. URLs lack a consistent/relevant mechanism for encoding beyond-ASCII characters, but there are is a proposal to remedy this, which should be adopted. In the process of adopting this proposal, we can safely assume that the Japanese have done whatever they feel is necessary and possible to handle their writings on the computer, and that they know how to do this, and that even if some aspects of it might look extremely strange and clumsy and unnecessary and whatever to us, it's not our job to judge about this, or to try to convert them to use ASCII only and English. If they would have seen an advantage in doing so, they would have done so 100 years ago or 50 years ago. We need to solve the URL problem, not the Japanese input problem! > > The browser then would convert the Japanese characters into UTF-8 > > and (add %HH encoding) and pass the URL to the resolver machinery, > > where the host part would be resolved with DNS, and then the machine > > at the corresponding IP number would be contacted with HTTP. > > This discussion applies only to HTTP URLs, though. You're > proposing that the recommendation be put into place for > all existing URL schemes and new versions of them, too. Yes, of course. Let's have a look at them (as of RFC 1738): ftp: Is transitioning towards UTF-8 all by itself. No problem here. mailto: Does not allow beyond-ASCII characters. If it ever comes to going i18n, it will be happy to know that it can rely on beyond-ASCII characters being safely encoded in URLs. telnet: Nothing much to be internationalized except of the domain name. file: Haven't seen that in a newspaper or magazine, for obvious reasons. Is local, and so the conversion to and from UTF-8 can very well be short-cirtuited. If you want to include a comment to that effect, nobody will mind. That's what the "recommendation" is for (among else). gopher: I am not very familliar with this. From RFC 1738, it becomes apparent that Gopher+ allows for the equivalent of "Accept-Language:", but it is not clear how beyond-ASCII characters get encoded. From reference  in RFC 1738, it becomes evident that gopher+ is about as ignorant of the issues of internationalization as some other protocols, it just says that "undisplayable characters should be avoided, but if necessary, ISO-Latin1 should be used" :-(. news: Newsgroup names and message IDs are currenly ASCII only. So the same comments as for "mailto" apply. nntp: Same as for "news:". wais: Wais is based on Z39.50, which as far as I know has a background in libraries and the like, which in turn has a strong track record of solving character encoding problems upfront. Reference  in RFC 1738 is not available anymore. If you know where the currently relevant spec is, please give me a hint. prospero: The protocol spec says things such as that HSOnames (host specific object names) are defined to be ASCII only, but that <hsoname-type> exists because on some file systems, filenames might not be ASCII only. The conclusions is: some protocols don't currently need i18n of URLs, but UTF-8 will come in handy once they do. Some protocols already are going the right way. Some protocols are badly designed (solving things for ASCII and ignoring the rest) and it may be difficult to improve them. If a protocol has some explicit reason for why it can't use UTF-8, or for why it can't transition the current URLs, then that's it's own problem of bad design. > > That > > machine would of course have been set up so that the correct page > > is returned. > > How, please, is the machine set up? What has to be done at > the server & system administration level? Well, for the domain name, that depends on the solution that will be taken for that problem. For the path/file component in the HTTP case, some solutions have already been shown, and others have been discussed in detail. I don't think that I need to rehash them here. > What's the transition > strategy for a server that wants to serve current clients > as well as these new browsers that can deal with the proposal > you're promoting? Well, the current clients may fall in two categories. The first is one that doesn't handle kanji at all. Then we are out of luck, but that may not be too bad, as we couldn't view the retrieved document anyway. The second category will try to send the characters as some natively encoded URL. For the case above (sparsely populated namespace), heuristically guided trial-and-error on the server side will cover even cases when there are more than one frequently used native encoding. But maybe this is too much work. Then for a transitory period, we can just publish two URLs, one with full kanji and the other with an ASCII alternative or with the same thing with %HH. Or some third party can offer a service for old browsers (e.g. a form where you can type in the kanji as you see them, with the traditional method (Roy's 2c+tricks), do the conversion to UTF-8, and then do a redirect). The web is extremely flexible, one just has to be ready to use this flexibility before claiming that things "won't work". Anyway, I would definitely advise people to wait a little bit (one browser generation or so) before they start to put out URLs with native characters. > > I hope this explanation is detailled enough. If you don't understand > > some part of it, please tell us. > > As you see, it was inadequate for the purposes of > being a stand-in for 'running code': there are > a number of unresolved design issues in your plan, > those design issues must be resolved before interoperable > implementations can be deployed, and I'm uncertain > as to whether the results, when taken in total, > actually solve the problem you set out to solve, > or even improve the situation significantly. And > given the difficult transition strategy and lack > of interoperability with currently deployed systems, > I doubt that a proposal will actually be adopted > unless that's so. I am very sure that the general design is well-developped, and that the web is flexible enough to have solutions available for the transition. Regards, Martin.