Re: utf-8

On Wed, 10 Sep 2003 17:42:38 +0200, wrote:

>Okay, I feel stupid, I've purchased the utf-8 spec from iso, and they
>explain how to convert from utf-8 to ucs4, I'm afraid we're talking past one
>another. My question is simply: How can 4 bytes be represented in 2 bytes,
>it can't be done. what am I missing?

Unicode character values can range from U+0000 to U+10FFFF. To represent
these characters in an 8-bit encoding, you use UTF-8, which you already
know about. To represent those characters in a 16-bit encoding, you use
UTF-16, which works as follows:

* For character values between U+0000 and U+FFFF, the UTF-16 value is
just the same as the character value.

* For character values greater than U+FFFF, generate a "surrogate pair"
of UTF-16 values:

 H = ((V - 65536) div 1024) + 55296
 L = ((V - 65536) mod 1024) + 56320

or, in hex notation:

 H = ((V - 0x10000) div 0x400) + 0xD800
 L = ((V - 0x10000) mod 0x400) + 0xDC00

where V is the original character value, the "div" operation represents
integer division (throw away the fraction), and the "mod" operation
returns the remainder after integer division.

In this way, 4-byte character values are represented by a sequence of
two 2-byte UTF-16 values, H and L (H always comes first). Some examples:

 V          H        L
 U+010000   0xD800   0xDC00
 U+010001   0xD800   0xDC01
 U+0100FF   0xD800   0xDCFF
 U+010100   0xD800   0xDD00
 U+0103FF   0xD800   0xDFFF
 U+010400   0xD801   0xDC00
 U+010401   0xD801   0xDC01
 U+10FFFF   0xDBFF   0xDFFF

In case you're wondering about the Unicode characters in the range
U+D800 through U+DFFF (which overlaps the range used by the surrogates),
the answer is that they're aren't any Unicode characters in that
range--it's reserved for the surrogates.

There's one further complication: UTF-16 doesn't specify which byte
comes first within the pair of bytes that comprises a UTF-16 value. So
there are actually two flavors of UTF-16, UTF-16BE for "big-endian"
systems, in which the more significant byte comes first, and UTF-16LE
for "little-endian" systems, in which the less significant byte comes
first. But this complication only comes into play during byte-oriented
serializtion and un-serialization.

Note that the DOM specifes the use of UTF-16 to represent characters, so
the answer to your question about the DOM using only two bytes per
character is no. Yes, the DOM uses "16-bit units" to represent
characters, but for some characters (i.e., those above U+FFFF), two
16-bit units are required in order to represent one character. This is
explained in Section 1.1.5 of the DOM Core rec.

Steve Schafer
Fenestra Technologies Corp

Received on Wednesday, 10 September 2003 13:06:49 UTC