- From: Gavin Nicol <gtn@ebt.com>
- Date: Wed, 7 Dec 1994 22:09:37 -0500
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
- Cc: html-wg@oclc.org
>Gavin Nicol (gtn@ebt.com) recently proposed using Unicode as a kind of >character set "bus". Servers and clients would deal with their own, local Yes, but a very important part of my proposal is the auto-tagging performed in the conversion process. Simply using Unicode, with no disambigutation tags, will cause information loss. In addition to the Han Unification problem, my proposal could probably be used to extend the coverage for the languages that Unicode does not cover completely. In addition, I think it is very important to maintain the presentational differences of local encoding systems. For example, on certain Japanese platforms, zenkaku and hankaku are used extensively, and in Japanese typography, the Kinsoku (rendering) rules are different. In Unicode, they have the same code position I believe. As you note toward the end of your document, such "hints" are not generally required to get legible renderings, but they are *vital* for getting top quality output. >1. Since there is complete round-trip mapping between Unicode and the >character sets in use today. I understand that it does not cover Thai completely, and that it only covers 12 of the 15 Indian languages. I may be wrong though. >The only disadvantage I see is how to enable such use of Unicode while >still supporting existing clients who don't understand it (the issue raised >by Sandra O'Donnell of OSF). In the absence of a character set negotiation >protocol in HTTP, it's not clear how to do this. It is certainly not the Yes, and as I pointed out, the "text/html; charset=xxxxx" is not sufficient. We need a proper field in HTTP. >clients. It has the disadvantage that Asian characters take three bytes >each, a 50% increase in overhead over the encoding methods used now. Are all Asian encodings 2 bytes? I believe Thai is 4.... >Straight Unicode results in no overhead for Asian users, but users of >8859-x will pay double overhead. Since it is very easy to mechanically >convert between UTF-8 and Unicode, if there were a character set >negotiation protocol, you could select the appropriate encoding based on >client preferences. Yes, and as another optimisation, 2 systems with the same local character set and encoding could use that (SJIS->SJIS). In the case of SGML/HTML, this buys you little though: in the parser, it is simpler to have everything be 16 bits wide. That is, you have: Machine #1 Machine #2 Parser SJIS--------------->SJIS--(conv)------->Unicode/Wide Characters SJIS----(conv)----->UTF---(conv)------->Unicode/Wide Characters >Another issue with using straight Unicode is that the MIME spec (at least, >the new draft version) specifies that all subtypes of the "text" content >type must use CRLF (0x0D 0x0A) line feed conventions, which rules out any >character set that does not use ASCII as a base, and certainly rules out a >16 bit character set like Unicode. So this suggests UTF-8 or UTF-7, or a change in the MIME standard. >Also note that this only occurs when the recipient has a multilingual, >multi-character set operating system with the right fonts installed. This >capability is still a rarity. In all other cases, the language tagging >won't make any difference in the end result. Planning for the future is never a bad thing. The language/presentational hint auto-tagging can be ignored by many current systems (and it *could* be part of format negotiation), but as Unicode aware systems become more popular, they will be useful. >In conclusion, I'm pretty familiar with Unicode but not very familiar >with HTTP and HTML. I am hoping to work together with the people on >this list to enable use of Unicode on WWW. Nice to have you here! I am not a Unicode expert.... Anyway, as I said in a recent posting, and again, above: just "text/html; charset=xxx" will not suffice for all the upcoming document formats, so we need a specific field in HTTP allocated for charset negotiation, and I would like to see UTF-7 or UTF-8 strongly recommended as the preferred encoding for documents in which the default MIME and HTML (US-ASCII and 8859-1) do not suffice. I will do the necessary editorial work if need be. Of course, for quick acceptance, we also need *free* libraries (in the GNU/BSD sense) for Unicode conversion, and Unicode font mapping. I would be *very* interested in hearing about peoples experiences implementing Unicode based systems. My main experience is in a windowing system I am writing for an experimental OS (shades of plan9?). --------------- Gavin Nicol gtn@ebt.com *Not* speaking for EBT.
Received on Wednesday, 7 December 1994 19:08:30 UTC