- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Thu, 3 Apr 1997 16:53:33 +0200 (MET DST)
- To: Larry Masinter <masinter@parc.xerox.com>
- Cc: Edward Cherlin <cherlin@newbie.net>, uri@bunyip.com
On Wed, 2 Apr 1997, Larry Masinter wrote: > In my personal judgement, there was significant controversy > about adding to a Draft Standard document additional constraints > that were not part of the Proposed Standard and are not > implemented in at least two interoperable implementations. In the current discussion, started by my original proposals in mid February, there was definitely no "significant controversy" about procedural matters such as those you mention above. If you think otherwise, please give the references to the mailing archive. As you can see below, there is no need for such a controversy. If you had brought this subject up earlier, I could have answered as below earlier. Also, there are in no way any additional constraints. There are only recommendations. I have clearly shown that these don't affect existing (or even future) implementations in any major way. If you want to challenge this, please give the details. In addition, the requirement of "two interoperable implementations" is rather easy to fulfill, too easy to actually even bother about it except for procedural reasons. For those that have been seriously involved in the discussion, this is quite clear, but I will explain it here in detail (don't blame me, please, if you think this is too detailled!). Obviously, on the browser side, the only thing we need is the ability to input %HH-escaped UTF-8 URLs. There are dozens of browsers that allow this as of now! On the server side, we need two sites with some resource names actually in UTF-8. I will provide two here, one http and one ftp, with one resource name each. As these two "implementations" are not personally independent, I hope somebody else can provide another one. I just provided one resource name each in UTF-8, which should be enough, but if you think that this works by chance rather than in general, please tell me what other kinds of names you would like (me) to provide. This is the first URL: --------------------- http://www.ifi.unizh.ch/mml/mduerst/%e3%83%ab%e3%83%93.html which is actually a link to: http://www.ifi.unizh.ch/mml/mduerst/ruby.jp.html The part %e3%83%ab%e3%83%93 are the two katakana characters for "ruby", U+30EB U+30D3, in UTF-8 (please check for yourself). [The contents is an attempt of a translation of one of my earlier documents about ruby in HTML into Japanese, never finished :-(. It's not up to date anymore, but that's obviously irrelevant here.] I tested this with Netscape Gold 3.01 and with NTSC Mosaic 2.6, both on a Sun. This is the second URL: ---------------------- ftp://ftp.ifi.unizh.ch/pub/multilingual/ %E6%9B%B8%E4%BD%93%E7%B5%84%E3%81%BF%E5%90%88%E3%82%8F%E3%81%9B.ps.Z It is a link to ftp://ftp.ifi.unizh.ch/pub/multilingual/FontComposition.ps.Z The long %HH escapes are the Japanese characters for "shotaikumiawase", a translation of "font composition". Unicode codepoints available on request. [The file itself is a prepublication version of a paper written for the 1995 Unicode conference; I can recommend it for those that are interested in the subject.] If you want to avoid to have to type in the whole long string, just go to the URL ftp://ftp.ifi.unizh.ch/pub/multilingual/ and click on the one name that is not ASCII. In Netscape, for example, the filename will show as some garbage, unless you have a 4.0 version and set "document encoding" to UTF-8. I guess the same works for Internet Explorer 4.0. Why does anybody claim that we have no interworking applications, when things already work better even than we might like it (namely that they strictly use %HH)? In Netscape 3.0, you can verify the URL as above with "view source". Some people may wonder how I created these file/resource names. Well, I used an editor offering multilingual input facilities and the ability to store a file as UTF-8 to write a shell script with the needed "ln" commands. One would not even need an editor capable of UTF-8, a filter converting to UTF-8 would also do the job. The whole thing gets a little bit more difficult if your original filenames are not pure ASCII; you would have to have two different encodings in one and the same shell script file. As an aside, I found a very nice feature in Apache 1.2 that will probably allow to very easily make a server that serves UTF-8 names for a site with overall or per directory fixed legacy encodings of file/resource names. It is the module "rewrite". This can be configured to call a program for rewrites. If you choose tcs as the program and configure it so that it converts from UTF-8 to e.g. Latin-1, it will do a nice job. Tcs has to be changed to work non-buffered, but that's very easy. If defined on a per-site basis, the tcs program is started only once. This setup will not yet address the upgrade path; for this, a general "retry" facility is needed, i.e. a way to specify "try with this name, if you don't find the resource, retry with another name". Such a facility may exist somewhere in Apache, or could probably be built in with many other uses. The alternative is to use links, as already explained. The whole story about "reference implementation" shows that we are dealing here not with adding something new to an existing protocol, but with *recommending* clear semantics in an area that was up to now blatantly ignored and needs some fix, the sooner the better. > As I said, I edited the document to contain those changes that > I thought were non-controversial. I hope it is fair to say that there was rough consensus on recommending UTF-8 (with %HH) for character encoding. You have acknowledged this consensus for the process draft, and while the discussion on both lists did not proceed exactly in parallel, there is really nothing much that would in the end distinguish the discussion on both lists. > > URL creation mechanisms that generate the URL from a source which > > is not restricted to a single character->octet encoding are > > encouraged, but not required, to transition resource names toward > > using UTF-8 exclusively. > > URL creation mechanisms that generate the URL from a source which > > is restricted to a single character->octet encoding should use UTF-8 > > exclusively. If the source encoding is not UTF-8, then a mapping > > between the source encoding and UTF-8 should be used. > > > This is an additional requirement that does not correspond, > as far as I can tell, to any kind of "implementation experience". > I know of no URL creation mechanisms that actually do this. See above. "implementation experience" is obviously trivial. > Further, I think that the complaints that there is a certain > amount of ambiguity in practice over exactly how one goes > about doing this are legitimate, and that not only is there > no "running code", there is not "rough consensus". The code that we have is obviously very much sufficient. Rough consensus is there, the word "rough", as I have seen it interpreted in IETF working groups, takes care of the case of a single individual raising the same far-fetched and unrelated complaints over and over, in a rather short and cryptic manner, even after they have been addressed in detail. I don't know exactly what you intend to refer to with "certain ambiguity". If you mean ambiguities arising from URLs such as http://0oO0Il1.com/IlIl10oO.html, this is obviously a problem that is ignored for ASCII, because of the correct assumption that URL generators learn to avoid such cases by trial and error if not otherwise. I do not think that at the present time, things beyond ASCII need to be specified more explicitly than ASCII itself, in this respect. I very well acknowledge that for some cases, some more detailled specifications are highly desirable. I have talked with many people about the issues involved, and I have repeatedly volunteered to work on the necessary documemnts. However, I do not see any sense in writing such documents in the void, without a clear commitment for a good solution in the central document. Actually, I would like nothing more than finishing the current controversy on the base issue and having some time to work on more documentation. I therefore sincerely hope that we can stop useless "procedural concerns" as above as quickly as possible. [Also, as long as we are only concerned with %HH (this is the only thing that should go into the current draft, I agree that the transition to using "native" URLs is something more experimental, and that the necessary documents for it will have to be written), the potential ambiguities actually don't arise :-]. > > I'm surprised, too. I thought we had this worked out, and that > > there was no significant objection or controversy. > > I hope that the domain name from which you post ("newbie.net") > isn't some kind of joke. If you insist, I will forward you > the three hundred or so email messages discussing the controversy > around the proposed additions. I guess there is no need to do that. Edward is very well aware of the discussion that went on. Some of the best contributions to it are from him. He probably followed the discussion more closely than many others. Threatening him with mail flooding is beyond what I want to comment about. Regards, Martin.
Received on Thursday, 3 April 1997 09:55:24 UTC