- From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- Date: Mon, 8 Mar 2004 08:42:00 -0500
- To: Jon Hanna <jon@hackcraft.net>
- Cc: Tim Bray <tbray@textuality.com>, "www-tag@w3.org" <www-tag@w3.org>
At 12:17 PM +0000 3/8/04, Jon Hanna wrote: >That is correct East-Asian and Indic languages will typically take 50% more >octets to encode the text in UTF-8 than in UTF-16. >Languages that use the Latin script will take somewhere in the region of >90%-100% more octets to encode the same text in UTF-16 than in UTF-8. In plain, unilingual text, yes. In practice when working with real-world XML in Asian languages, the gain is not so dramatic. XML documents in any language tend to be full of characters from the ASCII range like <, >, =, ", &, ;, and the space. In a record like document with lots of white space for pretty printing and small field values (remember Chinese especially is very compressed to start with since a character equals a word, Japanese only somewhat less so), easily half the text may be ASCII. If the documents use English tag names (say XHTML or DocBook or SOAP) in conjunction with Asian PCDTA, the difference is even smaller. At one point I experimented with switching between UTF-8 and UTF-16 depending on language, and was surprised to find it really didn't make a big difference. For one real world example, I looked at the Japanese translation of the XML specification included in the W3C XML test suite. The UTF-8 version is 202K. The UTF-16 version is 305K, 50% larger! Of course, this can be highly dependent on the nature of the documents. An originally Japanese document with Japanese markup and no internal DTD subset might reverse these numbers, or at least bring them into parity. -- Elliotte Rusty Harold elharo@metalab.unc.edu Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA
Received on Monday, 8 March 2004 08:47:22 UTC