- From: Jon Hanna <jon@spin.ie>
- Date: Tue, 16 Jul 2002 13:29:52 +0100
- To: <w3c-wai-ig@w3.org>
> Let us say I am building a website today. It is multilingual, and needs > to cater to people who don't quite know what a "browser" is. They just > use ""the internet", which could be IE 3 or Netscape 4 -- just because > it did not occur to them that they need to upgrade (and I speak of a > very big firm here in Tokyo). Currently, let us say we use big5 for > Chinese and Shift_JIS for Japanese content respectively. > > Assuming this is old hat, how should we build websites that allow us to > be future-proof and support standards? Specifically, how should we > convert to UTF-8 etc? Is there any suggested guideline we should follow? > Any tool that allows us to convert big5 formatted text to UTF-8 text? big5 and Shift_JIS aren't old hat. XML browsers MUST support UTF-8 and UTF-16, but MAY support any other character set. It is reasonable to assume that big5 and Shift_JIS will be available to all Chinese and Japanese browsers for some time to come. They are also more efficient than UTF-8 for Chinese and Japanese respectively in terms of the number of octets that have to be sent down the wire; UTF-8 encodings vary in octets-per-character in such a manner that the lower the UCS position the fewer the number of octets needed. ASCII characters only take one octet each - and indeed are identical to ASCII transmitted with an empty bit with every character, but Chinese and Japanese characters would take more. UTF-16 would be more efficient for such characters, however UTF-16 encodings of mainly ASCII text is twice the size it would be if UTF-8 was used, making it less ideal for languages that use Roman alphabets, especially those without diacritical marks. As long as there is a way to determine what character encoding was used for a particular document then you will be able to convert it to UTF-16 in the future if necessary, either as it is stored, or on-the-fly as part of content-negotiation. If so I wouldn't worry about it too much now, although I would worry if I couldn't determine what character encoding was used for each document in any multi-lingual system.
Received on Tuesday, 16 July 2002 08:29:53 UTC