RE: UTF8/UTF16 from SHARPE, Ian on 2002-08-21 (w3c-wai-ig@w3.org from July to September 2002)

From: SHARPE, Ian <Ian.SHARPE@cambridge.sema.slb.com>
Date: Wed, 21 Aug 2002 08:48:55 +0100
To: "WAI (E-mail)" <w3c-wai-ig@w3.org>
Message-ID: <FA94B04D5981D211B86800A0C9EA2841A34D18@cames1.sema.co.uk>
OK, the mist is clearing. But I'm still a little confused. Here's a section
from:

http://www.ietf.org/rfc/rfc2279.txt

"ISO/IEC 10646-1 [ISO-10646] defines a multi-octet character set
   called the Universal Character Set (UCS), which encompasses most of
   the world's writing systems.  Two multi-octet encodings are defined,
   a four-octet per character encoding called UCS-4 and a two-octet per
   character encoding called UCS-2, able to address only the first 64K
   characters of the UCS (the Basic Multilingual Plane, BMP), outside of
   which there are currently no assignments.

   It is noteworthy that the same set of characters is defined by the
   Unicode standard [UNICODE], which further defines additional
   character properties and other application details of great interest
   to implementors, but does not have the UCS-4 encoding."

So from this I understand that ISO 10646 is the basis for UCS4 and UCS2 and
Unicode just so happens to use the same value to represent the same
character points as ISO 10646 which is why we maybe use the terms
interchangably. Not usre what "but does not have the UCS4 encoding" means
though? Also that UCS2 is a subset of UCS4.

Again from the reference:

"UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
   into pairs of UCS-2 values from a reserved range.  UTF-16 impacts
   UTF-8 in that UCS-2 values from the reserved range must be treated
   specially in the UTF-8 transformation."

Not sure what the first sentence here means? Why only a subset and which
subset? And the reserved range? I read the last sentence to mean that each
UTF16 character representation uses a pair of UTF8 character representations
to represent each character point. But this doesn't make sense if only 2
bytes are used to represent each character in UTF16 or why UTF16 is more
compact than UTF8? 

I'm sorry if I'm laboring the point (particularly as it only has a rather
tenuous link with accessibility as mentioned earlier - although language
support is clearly an accessibility issue and indeed it is in relation to
accessibility requirements I'm looking at) but I feel I'm so close to
actually understanding what's going on I just want to be absolutely clear
about it. 

Also apologies if I've missed something. I seem to have had some problems
with my subscription because I've been merrily posting away to the list and
receiving replies to my own messages when they have had my address included
but nothing else. Thought things were a bit quiet!! I think I'm sorted again
now though.

Cheers
Ian

-----Original Message-----
From: Jukka Korpela [mailto:jukka.korpela@tieke.fi]
Sent: 21 August 2002 06:36
To: SHARPE, Ian
Subject: FW: UTF8/UTF16




-----Original Message-----
From: David Woolley [mailto:david@djwhome.demon.co.uk]
Sent: Tuesday, August 20, 2002 11:49 PM
To: w3c-wai-ig@w3.org
Subject: Re: UTF8/UTF16



> Could somebody please explain the difference between UTF8 and UTF16 to me
> and why you would want to use UTF16 over UTF8? 

UTF16 uses two bytes per Unicode character (excluding the extension areas,
which use 4 bytes, but these shouldn't appear often).

UTF8 uses a variable number of bytes, such that American can be represented
in one byte, British requires two bytes, occasionally, Western European
languages require two bytes a lot of the time, and the rest of the world
needs three or four most of the time.  It codes for the same set of
characters as UTF16.

UTF16 is much easier to handle for software writers and is more efficient
for world languages.  Generally, world language aware software will 
use UTF16 internally.

UTF8 contains all the characters needed for the language structure of
HTML in 8 bit characters, which are the same as those in ASCII.

For HTML, you can only legally use UTF16 if you include the charset
parameter in the real HTTP headers, as meta elements can't be detected
unless the character set is ASCII compatible.  I'm not sure about XML;
it might recognize the Unicode byte order marks, used to signal UTF16.
Some browsers may sniff out UTF16, even when the HTTP headers don't
identify it.

> _________________________________________________________
> This email is confidential and intended solely for the use of the 

Bogus confidentiality notice deleted.


_________________________________________________________
This email is confidential and intended solely for the use of the 
individual to whom it is addressed. Any views or opinions presented are 
solely those of the author and do not necessarily represent those of 
SchlumbergerSema.
If you are not the intended recipient, be advised that you have received
this email in error and that any use, dissemination, forwarding, printing, 
or copying of this email is strictly prohibited.

If you have received this email in error please notify the
SchlumbergerSema Helpdesk by telephone on +44 (0) 121 627 5600.
_________________________________________________________
Received on Wednesday, 21 August 2002 03:49:35 UTC