Message-Id: <9508171323.AA21577@mocha.bunyip.com> Subject: Re: Globalizing URIs To: email@example.com (Keith Moore) Date: Thu, 17 Aug 1995 15:22:18 +0200 (MET DST) Cc: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org In-Reply-To: <199508162232.SAA25273@wilma.cs.utk.edu> from "Keith Moore" at Aug 16, 95 06:32:42 pm From: Martin J Duerst <email@example.com> >That's fine. HTTP servers, at least, are fairly free to work out >whatever mapping they wish between URLs and filenames. (It's less >clear about whether FTP servers can do this). I guess that as far as most of my proposals are concerned, this is not a problem with FTP, either. FTP servers are not forbidden to serve an additional file telling the client how to reasonably interpret the filenames they serve. Also, I think the FTP protocol doesn't specify that exactly the same binary representation has to be used for filenames locally as over the protocol connection. And if we create a new scheme such as HFTP, then we have even less restrictions. >> I do not assume that anybody writing an English document is giving >> it a Japanese document name (and thus, for some protocols, a >> HRI containing some Japanese). In some posts, I assumed this >> implicitly, and in other I have said it explicitly. And I am going >> a little further into details here. >> I assume that anybody in Japan writing an English document for >> a worldwide public will, with just a little bit of common sense, >> and at least on his/her second try if not on the first, understand >> that that document should have a HRI that can be typed by somebody >> that understands only English. > >If you want an HRI that is only available to others in the same >country, this is fine. I'm thinking in terms of worldwide >interoperability. When I said "Japan" above, this was just an example; Japanese HRI are in no way restricted to Japan, although they are restricted to Japanese speaking/reading people, which makes a lot of sense if they refer to documents that only these people are able to understand. >My understanding is that there is a bit more >uniformity in Japan about which character sets are in use, than there >are in other parts of the world. In this respect, I would say that Japan is just about an average case. What is important is not so much how many variants there are around, but that there are different variants (which clearly applies for Japan, as PCs and workstations use different encodings, which both differ from the encoding used in email and such), and that this will prevent us to make uniformity assumptions that are too optimistic. >But just taking Japanese as an >example, what happens if a Japanese-fluent person in the US wants to >read a Japanese document that has a Japanese HRL? He can read >Japanese but perhaps doesn't have a Japanese keyboard. How is he >going to type in the HRL? A "Japanese keyboard" is nowadays a software issue. The ease or difficulty with which it is possible to install additional software that allows to read and input foreign languages is not the same on every system, but is already quite high and steadily improving. It's still so that additional software has to be bought or installed, and this is easier on e.g. a Mac than on a Unix system, but this just reflects the general difference of ease of use for these systems. On a Mac, you can buy the JLK (Japanese language kit, there is also a Chinese one and probably a Korean one, and for other languages, you don't even need that much) worldwide, and installation takes not more than an hour (mainly because of the many floppies with the large fonts). For a Unix system with X11, the MIT distribution already contains several items necessary for Japanese, and their installation is no problem if you have a friendly system administrator (who does not need to speek Japanese). >It's my understanding that there are many countries in which there is >little or no uniformity from one site to another (or even sometimes >within a particular site) as to what character sets are used. There >are also groups of countries that share a language, but have different >conventions for what character set to use within that country. The proposals I have made can very well take care of non-uniformity. Some do assume uniformity for a single site, but others don't. And as I have said, Japan in this respect is a very adequate example. Also, I can tell you that Japan is not the only country I am familiar with (although I have to say it is the one, besides western Europe, I am most familar with). I have implemented software for Korean input, for Arabic display, and for conversions for many other places. >> There are vulnerabilities in all character sets (e.g. the ASCII/EBCDIC >> problem for present URLs), but the vulnerability may indeed be said >> to be lower for ASCII than for things that go beyond. >> Nevertheless, these vulnerabilities are now very well understood, >> and satisfactory solutions are available and installed. With respect >> to ISO 8859/1, a HTML web document will display Western European >> accents correctly on hardware such as Mac, PC, and NeXT, even though >> they use different native encodings of accented characters. > >Yes, they'll display the same, but does that mean that the users know >how to type them in? With the increasing distribution of the web, there might indeed apear users who see use the computer more or less like interactive TV, just clicking with the mouse on a button, and not knowing or using a keyboard anymore. But this is a problem that may as well appear in the US as elsewere. >What happens if you copy a section of text from >the web browser (which understands 8859/1) to another application that >understands a Mac charset or a PC code page? I just had a little try with Netscape on my Mac. On cut/copy/paste, it does exactly what the average user would expect but what you seem to have difficulties to believe that it is possible. And this not only applies to the translation from ISO 8859-1 to Mac-Roman, but also to the translation from the various Japanese encodings to the one used on the Mac. >> Also, in Japan, News and Email is working among a wide variety >> of platforms although on various platforms, 3 different encodings >> of the same character repertoire are in use. Apart from issues >> such as those we have with '1', 'I', and 'l', which Japanese >> are aware of as well as we are aware of them, there are no more >> problems today having a Japanese text entered into a system, >> transmitted to another, copied to paper from the screen by hand >> or printed out, and entered again. Otherwise, Japanese information >> processing would indeed by in a very bad state. The same applies >> for other areas. > >I'm glad to hear that things are going so well in Japan, but I'm >told that things are not so nice in other areas. Different areas are in different stages of developping their information infrastructure. And I don't demand that we design a scheme so that we can have HRIs for a script for which email conventions are not yet reasonably established. It is very clear that having native email capabilities, and otherwise the capability of exchanging document contents reliably, is more important that document names and identifications. But what we should be working on is a proposal so that once conventions for document contents are established (e.g. by MIME types being defined, or by defining equivalences to the global ISO 10646, depending on the proposal), this can be extended without additional work to resource identifiers. >I agree with that statement. It's not sufficient to simply say "this >must be solved by the upper layer". We must BUILD the upper layers. >On the other hand, it will be up to Japanese speakers to build upper >layers (search services that map titles to URLs or URNs) that >understand Japanese. I don't see any reason why these can't be built >now to link Japanese titles to URLs, and modified later to link >Japanese titles to URNs. I think it is safe to say that such upper layers shouldn't be designed and implemented separately for each language or script our country, but that as much as possible a general solution should be sought for. I don't request that the US does develop everything for everybody, Japanese and many others around the world can very well contribute their share. But as I have said before, I think it is unfair to pretend that one has a global and abstract solution when in practice it favors some groups over others, and then just tell them: well, if you want the same functionality, why not do it yourselves. >> The many local encodings besides Unicode/ISO 10646 will >> most probably be a vanishing breed in the future. > >The jury is still out on Unicode. We've got at least one of almost >every major computer system here, but I don't know of a single one of >them that supports Unicode in its current release. (pre-released >products don't count) Please have a look at http://www.stonehand.com/unicode/products.html, and that site in general. Windows NT, Penpoint, and the Newton definitely work with Unicode. And there is a Unicode "locale" for IBM AIX, you only have to care to get and install it. And the jury on Unicode depends as much on companies implementing it (and I can assure you that most major players definitely are working on that, although their plans for deployment and distribution may differ) as it depends on other communities to access it and propose its use, in adequate forms and places, when they see that it indeed can simplify things and solve problems. >I've mentioned the specific problems I see, and I think they're pretty >serious. I have given more details on the problems you see. If you still have some questions, please follow up. >If you assume that they're transcribable, or you limit the >domain of applicability of HRLs to environments where they are >transcribable, you might be able to address the rest of the concerns. Many thanks for this (at least partially) positive statement. The fact that HRIs come after mail and such pretty much automatically assures transcribability (in those environments where it makes sense to have a look at the documents, anyway). >Okay, but be careful about assuming that things are as nice everywhere >as they are in Japan, and make sure you think about speakers of a >particular language living outside of the nice environment for that >language -- sometimes there are large numbers of these. I have addressed these concerns above. I am myself in that situation (using Japanese outside Japan) and know how it feels. The one point where Japan (and the Japanese outside Japan) may be special is that it is a very atractive market financially, but that is just one argument more of why we should work for general solutions (and the proposals I have made do in no way priviledge Japan), and not wait until somebody in some country comes up with a solution that may work there some way or another, but will lead to even more clumsy solutions if it is transferred to other places. That has happend many times before, unfortunately. >> First, is it lengthy, which is especially unconvenient for business cards >> and such. Most of what I have proposed is considerably shorter. > >This can be dealt with. >Format them like so: > >http://dom.ain/[document-id]/human-readable-string >Being realistic, this won't help people to remember a particular URL. Yes. It just gives half of the benefits of what a current URL gives an English reader or writer. If possible, I would like to do better than that. >> Second, the "[unicode]" prefix is not exactly needed. If the >> "some-random-string" is in the same encoding as the document as a >> whole, then there is no need to explain the encoding. > >Maybe, maybe not. If you want the things to survive being emailed >around, I'd still recommend that you encode things in ASCII and >include a charset name (similar to RFC 1522). If things are mailed around as part of a document, the encoding (MIME charset parameter) is of course included (but it is RFC 1521 that applies in this case), and sometimes it is implied between people that know from each other what encodings they use. Whether the results can be called ASCII is a different question; they may be 7-bit or 8-bit, or whatever, and maybe further be sumbitted to a transfer encoding such as BASE64. Regards, Martin.