Re: character sets - a summary and a proposal
Michael Sperberg-McQueen writes:
>Is this compromise workable?
The short answer: Yes
The long answer:
Thanks for the summary of the discussion so far,
although I would characterize my position more as "One Hundred Flowers
are going to bloom, how are we going to deal with it.", and you
propose a good solution.
Once concern I still have is the role of UTF-8 as both the
normalized form, and possibly the only form for network transmission.
UTF-8 is biased toward the efficient encoding of some languages at
the expense of others. Although I understand the pragmatic reasons
that one would want to specify UTF-8 as the standard, I am still
uncomfortable with this bias. Using UTF-16 is one solution to this
problem. One other possible solution is an encoding scheme from
Reuters that was presented at the UNICODE conference a couple of
I was not actually at the conference and don't have the
paper in front of me, so Gavin please correct me if I get this wrong.
Basically the Reuters scheme uses a sliding window across the UNICODE
character space 256 characters wide, with the window's
starting point set by a shift key followed by a couple of bytes to
specify the offset. By default, the window starts at the beginning of
the UNICODE character space and thus ASCII is represented unchanged.
This encoding also allows any other alphabetic language to be
represented efficiently. It is not quite so efficient with languages
that do require multiple bytes such as Chinese, Korean etc., but as
the paper points out each character in these languages carries quite a
bit more information then a single alphabetic character. The main
problem with this encoding is that its not a standard.
B. Todd Bauman
University of Maryland, Baltimore County