RE: non-sgml characters

> Let us say I am building a website today. It is multilingual, and needs
> to cater to people who don't quite know what a "browser" is. They just
> use ""the internet", which could be IE 3 or Netscape 4 -- just because
> it did not occur to them that they need to upgrade (and I speak of a
> very big firm here in Tokyo). Currently, let us say we use big5 for
> Chinese and Shift_JIS for Japanese content respectively.
>
> Assuming this is old hat, how should we build websites that allow us to
> be future-proof and support standards? Specifically, how should we
> convert to UTF-8 etc? Is there any suggested guideline we should follow?
> Any tool that allows us to convert big5 formatted text to UTF-8 text?

big5 and Shift_JIS aren't old hat. XML browsers MUST support UTF-8 and
UTF-16, but MAY support any other character set. It is reasonable to assume
that big5 and Shift_JIS will be available to all Chinese and Japanese
browsers for some time to come.

They are also more efficient than UTF-8 for Chinese and Japanese
respectively in terms of the number of octets that have to be sent down the
wire; UTF-8 encodings vary in octets-per-character in such a manner that the
lower the UCS position the fewer the number of octets needed. ASCII
characters only take one octet each - and indeed are identical to ASCII
transmitted with an empty bit with every character, but Chinese and Japanese
characters would take more. UTF-16 would be more efficient for such
characters, however UTF-16 encodings of mainly ASCII text is twice the size
it would be if UTF-8 was used, making it less ideal for languages that use
Roman alphabets, especially those without diacritical marks.

As long as there is a way to determine what character encoding was used for
a particular document then you will be able to convert it to UTF-16 in the
future if necessary, either as it is stored, or on-the-fly as part of
content-negotiation. If so I wouldn't worry about it too much now, although
I would worry if I couldn't determine what character encoding was used for
each document in any multi-lingual system.

Received on Tuesday, 16 July 2002 08:29:53 UTC