Re: UTF8/UTF16 from David Woolley on 2002-08-20 (w3c-wai-ig@w3.org from July to September 2002)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Tue, 20 Aug 2002 21:48:41 +0100 (BST)
To: w3c-wai-ig@w3.org
Message-Id: <200208202048.g7KKmfJ04558@djwhome.demon.co.uk>

> Could somebody please explain the difference between UTF8 and UTF16 to me
> and why you would want to use UTF16 over UTF8? 

UTF16 uses two bytes per Unicode character (excluding the extension areas,
which use 4 bytes, but these shouldn't appear often).

UTF8 uses a variable number of bytes, such that American can be represented
in one byte, British requires two bytes, occasionally, Western European
languages require two bytes a lot of the time, and the rest of the world
needs three or four most of the time.  It codes for the same set of
characters as UTF16.

UTF16 is much easier to handle for software writers and is more efficient
for world languages.  Generally, world language aware software will 
use UTF16 internally.

UTF8 contains all the characters needed for the language structure of
HTML in 8 bit characters, which are the same as those in ASCII.

For HTML, you can only legally use UTF16 if you include the charset
parameter in the real HTTP headers, as meta elements can't be detected
unless the character set is ASCII compatible.  I'm not sure about XML;
it might recognize the Unicode byte order marks, used to signal UTF16.
Some browsers may sniff out UTF16, even when the HTTP headers don't
identify it.

> _________________________________________________________
> This email is confidential and intended solely for the use of the 

Bogus confidentiality notice deleted.

Received on Tuesday, 20 August 2002 17:04:01 UTC