Re: utf-8

On Wed, 10 Sep 2003 17:42:38 +0200, sigler@bredband.no wrote:

>Okay, I feel stupid, I've purchased the utf-8 spec from iso, and they
>explain how to convert from utf-8 to ucs4, I'm afraid we're talking past one
>another. My question is simply: How can 4 bytes be represented in 2 bytes,
>it can't be done. what am I missing?

Unicode character values can range from U+0000 to U+10FFFF. To represent
these characters in an 8-bit encoding, you use UTF-8, which you already
know about. To represent those characters in a 16-bit encoding, you use
UTF-16, which works as follows:

* For character values between U+0000 and U+FFFF, the UTF-16 value is
just the same as the character value.

* For character values greater than U+FFFF, generate a "surrogate pair"
of UTF-16 values:

 H = ((V - 65536) div 1024) + 55296
 L = ((V - 65536) mod 1024) + 56320

or, in hex notation:

 H = ((V - 0x10000) div 0x400) + 0xD800
 L = ((V - 0x10000) mod 0x400) + 0xDC00

where V is the original character value, the "div" operation represents
integer division (throw away the fraction), and the "mod" operation
returns the remainder after integer division.

In this way, 4-byte character values are represented by a sequence of
two 2-byte UTF-16 values, H and L (H always comes first). Some examples:

 V          H        L
 U+010000   0xD800   0xDC00
 U+010001   0xD800   0xDC01
 ...
 U+0100FF   0xD800   0xDCFF
 U+010100   0xD800   0xDD00
 ...
 U+0103FF   0xD800   0xDFFF
 U+010400   0xD801   0xDC00
 U+010401   0xD801   0xDC01
 ...
 U+10FFFF   0xDBFF   0xDFFF

In case you're wondering about the Unicode characters in the range
U+D800 through U+DFFF (which overlaps the range used by the surrogates),
the answer is that they're aren't any Unicode characters in that
range--it's reserved for the surrogates.

There's one further complication: UTF-16 doesn't specify which byte
comes first within the pair of bytes that comprises a UTF-16 value. So
there are actually two flavors of UTF-16, UTF-16BE for "big-endian"
systems, in which the more significant byte comes first, and UTF-16LE
for "little-endian" systems, in which the less significant byte comes
first. But this complication only comes into play during byte-oriented
serializtion and un-serialization.

Note that the DOM specifes the use of UTF-16 to represent characters, so
the answer to your question about the DOM using only two bytes per
character is no. Yes, the DOM uses "16-bit units" to represent
characters, but for some characters (i.e., those above U+FFFF), two
16-bit units are required in order to represent one character. This is
explained in Section 1.1.5 of the DOM Core rec.

Steve Schafer
Fenestra Technologies Corp
http://www.fenestra.com/

Received on Wednesday, 10 September 2003 13:06:49 UTC