- From: Martin J. Dürst <mduerst@ifi.unizh.ch>
- Date: Fri, 19 Sep 1997 12:14:45 +0200 (MET DST)
- To: Chris Weider <cweider@microsoft.com>
- Cc: 'Ned Freed' <Ned.Freed@INNOSOFT.COM>, ietf-charsets@INNOSOFT.COM
On Mon, 1 Sep 1997, Chris Weider wrote: > I think Ned is completely correct here. The workshop report thought long > and hard about requiring language tagging and mandatory UTF-8 and > realized that this is the only way to make things work with the stupid > machines we have now :^) Chris - In general, I very much agree. Pushing UTF-8, and strongly telling people to include language tagging into their protocols, is very much what is needed. I definitely do not want to argue about this. However, both for UTF-8 and for language tags, as well as for other internationalization issues, I think it is important to not preclude further developments, and not to express requirements in an absolute and unchangeable way that makes protocol developers and implementers think that if only the do X, all their internatio- nalization problems go away. Specifically, I agree that UTF-8 in most cases is the best solution, especially in text-based protocols which are the majority of the current application area protocols. But there are also binary protocols (the Internet printing protocol currently under discussion is a recent examlpe), and for such protocols, there may be situations where 16-bit alignement is necessary in general and so using UTF-16 is a neat choice. In general, machines have moved from 8-bit to 16-bit to 32-bit to 64-bit architectures, and byte sizes have moved from 5 to 6 to 7 to 8 (with some cases of 9) bits. It may well be that it future, we see hardware that works with 128-bit words, but which has to use shift-and-mask techniques for 8-bit bytes, because the smallest size it supports directly is 16-bit bytes. Not that I think that these things will appear very soon in great numbers, but in ten years, they could very well be around. If we can word our policy so that it doesn't read ridiculously shortsighted if and when this happens, that would not be a bad thing, and wouldn't hurt at all. Specifically, I also agree that language tags are a big help to current stupid machines. But if we put an absolute requirement for language tags into our policy, a requirement that in the extreme might say: "Every protocol has to be able to language tag all the characters it sends around, with potentially different tags for each character.", and we thereby give implementors the impression that that's all they have to do, and text-to-speech conversion, machine-translation, spelling and grammar checks, hyphenation, high-quality display, and subtile glyph distinctions maybe necessary for names, and so on, will work magically and perfectly, then we clearly create the wrong impression. Ned Freed gave a very good examlpe in his mail about a case where langage information can be very helpful, namely the case where somebody receives email from a user and is not able to find out what language it is written in and therefore whom he should forward it for further processing. But for this case, automatic language detection seems to be extremely well suited. The idea has been around for years, but it is interesting to see that in the last few months and weeks (in particular at the recent Unicode conference in San Jose), I have heard from at least four companies that are seriously working on this issue, with good results, and with some of the adding new languages almost each week. In particular, Microsoft is doing such work, and Alis Technologies gave a floppy disk with a test program to selected users/clients. I guess that Ned could greatly benefit e.g. from the Alis program; I would be glad to organize the contact. Please note that this was on a floppy disk, covering something between 20 and 50 languages, so it's something that may well be a standard PC OS component in a few years. Again, let me make it very clear that I am not at all against UTF-8 or language tagging; I guess I have enough of a track record to make that clear (for UTF-8, have a look e.g. at the archives of the FTPEXT WG, for language tagging, see RFC 2070). But I think it is crucial that we word the policy in such a way that it lets protocol developpers take their own requirements into account, and take care of the specific internationalization needs, and not just follow a few words blindly. Regrads, Martin. --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Friday, 19 September 1997 11:35:37 UTC