Re: Reviewed charmod fundamentals from Jon Hanna on 2004-03-08 (www-tag@w3.org from March 2004)

From: Jon Hanna <jon@hackcraft.net>
Date: Mon, 8 Mar 2004 12:17:13 +0000
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Elliotte Rusty Harold <elharo@metalab.unc.edu>, Tim Bray <tbray@textuality.com>, "www-tag@w3.org" <www-tag@w3.org>
Message-ID: <1078748233.404c6449e624a@82.195.128.192>

Quoting Julian Reschke <julian.reschke@gmx.de>:

> As far as I understand, UTF-16 may perform (in terms of size) much 
> better for asian languages, so it seems that it makes a lot of sense if 
> protocols can choose UTF-8 vs UTF-16 based on what makes most sense for 
> the document content.

That is correct East-Asian and Indic languages will typically take 50% more
octets to encode the text in UTF-8 than in UTF-16.
Languages that use the Latin script will take somewhere in the region of
90%-100% more octets to encode the same text in UTF-16 than in UTF-8.

Compression tends to act as a leveller here, but not a perfect one.

So in the case of very large quantities of text the choice of encoding can have
an appreciable impact on download times, and allowing that choice to be made by
the transmitter seems sensible.

That said, there are other encodings capable of directly encoding the entire
Unicode repetoir that are more efficient in terms of stream size, but they are
quite complicated to process; which I would argue outweighs the benefits they
have for download times.

-- 
Jon Hanna
<http://www.hackcraft.net/>
"…it has been truly said that hackers have even more words for
equipment failures than Yiddish has for obnoxious people." - jargon.txt

Received on Monday, 8 March 2004 07:17:15 UTC