- From: Steve Schafer <steve@fenestra.com>
- Date: Wed, 10 Sep 2003 13:06:34 -0400
- To: www-svg@w3.org
On Wed, 10 Sep 2003 17:42:38 +0200, sigler@bredband.no wrote: >Okay, I feel stupid, I've purchased the utf-8 spec from iso, and they >explain how to convert from utf-8 to ucs4, I'm afraid we're talking past one >another. My question is simply: How can 4 bytes be represented in 2 bytes, >it can't be done. what am I missing? Unicode character values can range from U+0000 to U+10FFFF. To represent these characters in an 8-bit encoding, you use UTF-8, which you already know about. To represent those characters in a 16-bit encoding, you use UTF-16, which works as follows: * For character values between U+0000 and U+FFFF, the UTF-16 value is just the same as the character value. * For character values greater than U+FFFF, generate a "surrogate pair" of UTF-16 values: H = ((V - 65536) div 1024) + 55296 L = ((V - 65536) mod 1024) + 56320 or, in hex notation: H = ((V - 0x10000) div 0x400) + 0xD800 L = ((V - 0x10000) mod 0x400) + 0xDC00 where V is the original character value, the "div" operation represents integer division (throw away the fraction), and the "mod" operation returns the remainder after integer division. In this way, 4-byte character values are represented by a sequence of two 2-byte UTF-16 values, H and L (H always comes first). Some examples: V H L U+010000 0xD800 0xDC00 U+010001 0xD800 0xDC01 ... U+0100FF 0xD800 0xDCFF U+010100 0xD800 0xDD00 ... U+0103FF 0xD800 0xDFFF U+010400 0xD801 0xDC00 U+010401 0xD801 0xDC01 ... U+10FFFF 0xDBFF 0xDFFF In case you're wondering about the Unicode characters in the range U+D800 through U+DFFF (which overlaps the range used by the surrogates), the answer is that they're aren't any Unicode characters in that range--it's reserved for the surrogates. There's one further complication: UTF-16 doesn't specify which byte comes first within the pair of bytes that comprises a UTF-16 value. So there are actually two flavors of UTF-16, UTF-16BE for "big-endian" systems, in which the more significant byte comes first, and UTF-16LE for "little-endian" systems, in which the less significant byte comes first. But this complication only comes into play during byte-oriented serializtion and un-serialization. Note that the DOM specifes the use of UTF-16 to represent characters, so the answer to your question about the DOM using only two bytes per character is no. Yes, the DOM uses "16-bit units" to represent characters, but for some characters (i.e., those above U+FFFF), two 16-bit units are required in order to represent one character. This is explained in Section 1.1.5 of the DOM Core rec. Steve Schafer Fenestra Technologies Corp http://www.fenestra.com/
Received on Wednesday, 10 September 2003 13:06:49 UTC