- From: SHARPE, Ian <Ian.SHARPE@cambridge.sema.slb.com>
- Date: Wed, 21 Aug 2002 08:48:55 +0100
- To: "WAI (E-mail)" <w3c-wai-ig@w3.org>
OK, the mist is clearing. But I'm still a little confused. Here's a section from: http://www.ietf.org/rfc/rfc2279.txt "ISO/IEC 10646-1 [ISO-10646] defines a multi-octet character set called the Universal Character Set (UCS), which encompasses most of the world's writing systems. Two multi-octet encodings are defined, a four-octet per character encoding called UCS-4 and a two-octet per character encoding called UCS-2, able to address only the first 64K characters of the UCS (the Basic Multilingual Plane, BMP), outside of which there are currently no assignments. It is noteworthy that the same set of characters is defined by the Unicode standard [UNICODE], which further defines additional character properties and other application details of great interest to implementors, but does not have the UCS-4 encoding." So from this I understand that ISO 10646 is the basis for UCS4 and UCS2 and Unicode just so happens to use the same value to represent the same character points as ISO 10646 which is why we maybe use the terms interchangably. Not usre what "but does not have the UCS4 encoding" means though? Also that UCS2 is a subset of UCS4. Again from the reference: "UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire into pairs of UCS-2 values from a reserved range. UTF-16 impacts UTF-8 in that UCS-2 values from the reserved range must be treated specially in the UTF-8 transformation." Not sure what the first sentence here means? Why only a subset and which subset? And the reserved range? I read the last sentence to mean that each UTF16 character representation uses a pair of UTF8 character representations to represent each character point. But this doesn't make sense if only 2 bytes are used to represent each character in UTF16 or why UTF16 is more compact than UTF8? I'm sorry if I'm laboring the point (particularly as it only has a rather tenuous link with accessibility as mentioned earlier - although language support is clearly an accessibility issue and indeed it is in relation to accessibility requirements I'm looking at) but I feel I'm so close to actually understanding what's going on I just want to be absolutely clear about it. Also apologies if I've missed something. I seem to have had some problems with my subscription because I've been merrily posting away to the list and receiving replies to my own messages when they have had my address included but nothing else. Thought things were a bit quiet!! I think I'm sorted again now though. Cheers Ian -----Original Message----- From: Jukka Korpela [mailto:jukka.korpela@tieke.fi] Sent: 21 August 2002 06:36 To: SHARPE, Ian Subject: FW: UTF8/UTF16 -----Original Message----- From: David Woolley [mailto:david@djwhome.demon.co.uk] Sent: Tuesday, August 20, 2002 11:49 PM To: w3c-wai-ig@w3.org Subject: Re: UTF8/UTF16 > Could somebody please explain the difference between UTF8 and UTF16 to me > and why you would want to use UTF16 over UTF8? UTF16 uses two bytes per Unicode character (excluding the extension areas, which use 4 bytes, but these shouldn't appear often). UTF8 uses a variable number of bytes, such that American can be represented in one byte, British requires two bytes, occasionally, Western European languages require two bytes a lot of the time, and the rest of the world needs three or four most of the time. It codes for the same set of characters as UTF16. UTF16 is much easier to handle for software writers and is more efficient for world languages. Generally, world language aware software will use UTF16 internally. UTF8 contains all the characters needed for the language structure of HTML in 8 bit characters, which are the same as those in ASCII. For HTML, you can only legally use UTF16 if you include the charset parameter in the real HTTP headers, as meta elements can't be detected unless the character set is ASCII compatible. I'm not sure about XML; it might recognize the Unicode byte order marks, used to signal UTF16. Some browsers may sniff out UTF16, even when the HTTP headers don't identify it. > _________________________________________________________ > This email is confidential and intended solely for the use of the Bogus confidentiality notice deleted. _________________________________________________________ This email is confidential and intended solely for the use of the individual to whom it is addressed. Any views or opinions presented are solely those of the author and do not necessarily represent those of SchlumbergerSema. If you are not the intended recipient, be advised that you have received this email in error and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you have received this email in error please notify the SchlumbergerSema Helpdesk by telephone on +44 (0) 121 627 5600. _________________________________________________________
Received on Wednesday, 21 August 2002 03:49:35 UTC