UTF8 vs UTF16 from Robert O'Callahan on 2009-02-03 (public-i18n-core@w3.org from January to March 2009)

From: Robert O'Callahan <robert@ocallahan.org>
Date: Tue, 3 Feb 2009 23:59:59 +1300
To: Robert J Burns <rob@robburns.com>
Cc: public-i18n-core@w3.org, jonathan@jfkew.plus.com, W3C Style List <www-style@w3.org>
Message-ID: <11e306600902030259r253f560au26c491cbbbdef665@mail.gmail.com>

On Tue, Feb 3, 2009 at 11:04 PM, Robert J Burns <rob@robburns.com> wrote:

> Again, you're making assumptions that simply don't hold water. For
> documents in languages where UTF-8 requires 3-octets per code point there
> can certainly be enough 1-octet Latin script markup to offset the 3-octets
> in the natural language element content (averaging out to the 2-octets per
> code point of UTF-16), but those would likely be rare documents.
>
...

> Any script beyond U+07FF will take more octets to encode as UTF-8 than
> UTF-16. Regardless, both need to be supported by implementations and the
> savings of using UTF-8 for some documents and UTF-16 for the others isn't
> worth the trouble. For example, for a Chinese website they probably can
> count on most of their documents being more efficient encoded as UTF-16 than
> UTF-8 and the rare exceptions aren't worth looking out for.
>

Sorry, this is a complete tangent, but I can't resist. At Mozilla we did
some measurements last year which showed that, in fact, UTF-8 was a huge
size win over UTF-16 for storing the DOM text of a set of "top 20" CJK front
pages. Some results are here:
https://bugzilla.mozilla.org/show_bug.cgi?id=416411#c3
The "nsTextFragment" results are measuring the size of character storage for
DOM text nodes. UTF-8 uses just over half the storage of UTF-16. (Dave
repeated the analysis on some CJK Wikipedia articles, just to check, and got
similar results.) "Current" is a scheme where a text node with all
characters in the range 0-255 is stored as one byte per character, and all
other text nodes are stored in UTF-16 --- UTF-8 beats that handily.

I assume that text sent over the wire would be similar to the DOM text,
although in fact unparsed text would be even more Latin-1 heavy than these
results show, since tag and attribute names are not stored using
nsTextFragment.

One interesting observation is that scripts (ECMAScript, that is) are
skewing DOM text towards Latin-1.

We've seen no data showing that UTF-16 is useful in practice on the real Web
... except as a legacy encoding of course.

Rob
-- 
"He was pierced for our transgressions, he was crushed for our iniquities;
the punishment that brought us peace was upon him, and by his wounds we are
healed. We all, like sheep, have gone astray, each of us has turned to his
own way; and the LORD has laid on him the iniquity of us all." [Isaiah
53:5-6]

Received on Tuesday, 3 February 2009 11:00:36 UTC