- From: Glenn Adams <glenn@stonehand.com>
- Date: Fri, 30 Dec 94 09:27:03 -0500
- To: www@unicode.org
- Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, Gavin Nicol <gtn@ebt.com>
Date: Sun, 25 Dec 1994 10:56:45 -0500 From: Gavin Nicol <gtn@ebt.com> For example, if some Japanese text was sent to a person that cannot read Kanji, Hiragana, or Katakana, the browser could conceivably map the Japanese text into something like the following: Nihongo, tokuni Kanji, wa totemo muzukashii desu. Unless you store bunsetsu boundaries and yomi (phonetic reading) along with the original text, you can pretty much forget about automatic transliteration of Japanese (or Chinese for that matter). 3. Is Unicode the answer? In a word, YES! Though there are a number of issues that need to be resolved in order for it to be used effectively. Unless one has a strong masochistic streak, Unicode (or ISO/IEC 10646) will certainly be part of the answer. However, Unicode wasn't designed to solve every problem facing designers of multilingual systems. Never- theless, it makes a great foundation upon which to build such systems. [Like HTML, Unicode's designers knew they couldn't solve every problem if they wanted an implementable standard; the emphasis on a plain text encoding which intentionally disregarded certain higher-level language specific issues was an explicit design decision to aid in achieving an implementable result.] Folks should be aware the Unicode (as a profile of ISO/IEC 10646) is essentially becoming a national standard in many countries: e.g., Japan expects to publish JIS X 0221 this coming year which will be a national standard version of ISO/IEC 10646; China has already published GB 13000, also a national standard version of 10646. I wouldn't be surprised to see Korea do the same in the not too distant future. Date: Fri, 30 Dec 1994 07:44:26 -0500 From: Gavin Nicol <gtn@ebt.com> >>ISO 8879 also defines some methods for handling things like ISO-2022, >>but some encodings for languages such a Thai cannot be handled by >>SGML, even if the SGML declaration is altered (though, it is possible >>for the application to deal with this within, or before, the entity >>manager). >Just out of curiosity, what is so special about Thai? One encoding is very complicated: variable length, and no canonical order for certain bytes. In practise, various groups define a local "standard". I'm afraid I have to disagree with Gavin's original statement. There is nothing difficult about employing Thai in SGML. The encoding is actually not complicated at all, is not variable length, and the fact there is no canonical order for certain combining characters is irrelevant since they have a fixed position on the base consonant irregardless of their coding order in a sequence of multiple combining marks after a base consonant. Further, there is a standard Thai encoding, TIS 620, which is now commonly used. Keep in mind that the Unicode Thai character block is based on TIS 620. Gavin, please note that a variable length encoding (in the context you were speaking of, i.e., vis-a-vis ISO 2022) means that a variable number of bytes are used to encode one coded character element. In the case of the Thai character set TIS 620 (and Unicode), a fixed number of bytes is used to encode each coded character element (1 byte for TIS 620, and 2 bytes for a canonical UCS-2 Unicode character). The fact that multiple coded character elements in TIS 620 (and Unicode) can combine graphically to fill a single display cell does not mean that either of these character sets are "variable length encodings". The units of processing for transmission of information are still fixed length regardless of number of display cells used to display character data. Glenn Adams
Received on Friday, 30 December 1994 06:29:47 UTC