Re: A truly multilingual WWW

  Date: Sun, 25 Dec 1994 10:56:45 -0500
  From: Gavin Nicol <gtn@ebt.com>

   For example, if some Japanese text was
   sent to a person that cannot read Kanji, Hiragana, or Katakana, the
   browser could conceivably map the Japanese text into something like
   the following:

       Nihongo, tokuni Kanji, wa totemo muzukashii desu.

Unless you store bunsetsu boundaries and yomi (phonetic reading) along
with the original text, you can pretty much forget about automatic
transliteration of Japanese (or Chinese for that matter).

   3. Is Unicode the answer?

   In a word, YES! Though there are a number of issues that need to be
   resolved in order for it to be used effectively.

Unless one has a strong masochistic streak, Unicode (or ISO/IEC 10646)
will certainly be part of the answer.  However, Unicode wasn't designed
to solve every problem facing designers of multilingual systems. Never-
theless, it makes a great foundation upon which to build such systems.
[Like HTML, Unicode's designers knew they couldn't solve every problem
if they wanted an implementable standard; the emphasis on a plain text
encoding which intentionally disregarded certain higher-level language
specific issues was an explicit design decision to aid in achieving
an implementable result.]

Folks should be aware the Unicode (as a profile of ISO/IEC 10646) is
essentially becoming a national standard in many countries: e.g.,
Japan expects to publish JIS X 0221 this coming year which will be
a national standard version of ISO/IEC 10646; China has already
published GB 13000, also a national standard version of 10646. I
wouldn't be surprised to see Korea do the same in the not too distant
future.

  Date: Fri, 30 Dec 1994 07:44:26 -0500
  From: Gavin Nicol <gtn@ebt.com>

  >>ISO 8879 also defines some methods for handling things like ISO-2022,
  >>but some encodings for languages such a Thai cannot be handled by
  >>SGML, even if the SGML declaration is altered (though, it is possible
  >>for the application to deal with this within, or before, the entity
  >>manager).

  >Just out of curiosity, what is so special about Thai?

  One encoding is very complicated: variable length, and no canonical
  order for certain bytes. In practise, various groups define a local
  "standard".

I'm afraid I have to disagree with Gavin's original statement.  There
is nothing difficult about employing Thai in SGML.  The encoding is
actually not complicated at all, is not variable length, and the fact
there is no canonical order for certain combining characters is
irrelevant since they have a fixed position on the base consonant
irregardless of their coding order in a sequence of multiple combining
marks after a base consonant.  Further, there is a standard Thai encoding,
TIS 620, which is now commonly used. Keep in mind that the Unicode Thai  
character block is based on TIS 620.

Gavin, please note that a variable length encoding (in the
context you were speaking of, i.e., vis-a-vis ISO 2022) means that
a variable number of bytes are used to encode one coded character element.
In the case of the Thai character set TIS 620 (and Unicode), a fixed
number of bytes is used to encode each coded character element (1 byte
for TIS 620, and 2 bytes for a canonical UCS-2 Unicode character). The
fact that multiple coded character elements in TIS 620 (and Unicode)
can combine graphically to fill a single display cell does not mean
that either of these character sets are "variable length encodings".
The units of processing for transmission of information are still fixed
length regardless of number of display cells used to display character
data.

Glenn Adams

Received on Friday, 30 December 1994 06:29:47 UTC