- From: Robert O'Callahan <robert@ocallahan.org>
- Date: Tue, 3 Feb 2009 23:59:59 +1300
- To: Robert J Burns <rob@robburns.com>
- Cc: public-i18n-core@w3.org, jonathan@jfkew.plus.com, W3C Style List <www-style@w3.org>
- Message-ID: <11e306600902030259r253f560au26c491cbbbdef665@mail.gmail.com>
On Tue, Feb 3, 2009 at 11:04 PM, Robert J Burns <rob@robburns.com> wrote: > Again, you're making assumptions that simply don't hold water. For > documents in languages where UTF-8 requires 3-octets per code point there > can certainly be enough 1-octet Latin script markup to offset the 3-octets > in the natural language element content (averaging out to the 2-octets per > code point of UTF-16), but those would likely be rare documents. > ... > Any script beyond U+07FF will take more octets to encode as UTF-8 than > UTF-16. Regardless, both need to be supported by implementations and the > savings of using UTF-8 for some documents and UTF-16 for the others isn't > worth the trouble. For example, for a Chinese website they probably can > count on most of their documents being more efficient encoded as UTF-16 than > UTF-8 and the rare exceptions aren't worth looking out for. > Sorry, this is a complete tangent, but I can't resist. At Mozilla we did some measurements last year which showed that, in fact, UTF-8 was a huge size win over UTF-16 for storing the DOM text of a set of "top 20" CJK front pages. Some results are here: https://bugzilla.mozilla.org/show_bug.cgi?id=416411#c3 The "nsTextFragment" results are measuring the size of character storage for DOM text nodes. UTF-8 uses just over half the storage of UTF-16. (Dave repeated the analysis on some CJK Wikipedia articles, just to check, and got similar results.) "Current" is a scheme where a text node with all characters in the range 0-255 is stored as one byte per character, and all other text nodes are stored in UTF-16 --- UTF-8 beats that handily. I assume that text sent over the wire would be similar to the DOM text, although in fact unparsed text would be even more Latin-1 heavy than these results show, since tag and attribute names are not stored using nsTextFragment. One interesting observation is that scripts (ECMAScript, that is) are skewing DOM text towards Latin-1. We've seen no data showing that UTF-16 is useful in practice on the real Web ... except as a legacy encoding of course. Rob -- "He was pierced for our transgressions, he was crushed for our iniquities; the punishment that brought us peace was upon him, and by his wounds we are healed. We all, like sheep, have gone astray, each of us has turned to his own way; and the LORD has laid on him the iniquity of us all." [Isaiah 53:5-6]
Received on Tuesday, 3 February 2009 11:00:34 UTC