Re: utf-8 from Sigurd Lerstad on 2003-09-10 (www-svg@w3.org from September 2003)

From: Sigurd Lerstad <sigler@bredband.no>
Date: Wed, 10 Sep 2003 18:36:25 +0200
To: "Bjoern Hoehrmann" <derhoermi@gmx.net>
Cc: <www-svg@w3.org>
Message-ID: <05db01c377b9$ad371830$6e1273d5@mmstudio>

> * Sigurd Lerstad wrote:
> >> >In an XML file that says utf-8 in the xml declaration. There could be
4
> >> >byte characters later in the file. How should those be treated to
convert
> >> >them to utf-16?
> >>
> >> Just like any other sequence. U+10000 is F0 90 80 80 in UTF-8 and
> >> D8 00 DC 00 or 00 D8 00 DC (depending on byte order) in UTF-16.
>
> >Okay, I feel stupid, I've purchased the utf-8 spec from iso, and they
> >explain how to convert from utf-8 to ucs4, I'm afraid we're talking past
one
> >another. My question is simply: How can 4 bytes be represented in 2
bytes,
> >it can't be done. what am I missing?
>
> That UTF-16 does not mean two bytes per character. As I've said,
> characters above U+FFFF are represented using *four* bytes in UTF-16.

But the DOM always uses 2 bytes per character doesn't it? So how can 4 bytes
per character be represented by the DOM?

--
Sigurd Lerstad

Received on Wednesday, 10 September 2003 12:33:59 UTC