Re: Reviewed charmod fundamentals

At 12:17 PM +0000 3/8/04, Jon Hanna wrote:

>That is correct East-Asian and Indic languages will typically take 50% more
>octets to encode the text in UTF-8 than in UTF-16.
>Languages that use the Latin script will take somewhere in the region of
>90%-100% more octets to encode the same text in UTF-16 than in UTF-8.

In plain, unilingual text, yes. In practice when working with 
real-world XML in Asian languages, the gain is not so dramatic. XML 
documents in any language tend to be full of characters from the 
ASCII range like <, >, =, ", &, ;, and the space. In a record like 
document with lots of white space for pretty printing and small field 
values (remember Chinese especially is very compressed to start with 
since a character equals a word, Japanese only somewhat less so), 
easily half the text may be ASCII.

If the documents use English tag names (say XHTML or DocBook or SOAP) 
in conjunction with Asian PCDTA, the difference is even smaller. At 
one point I experimented with switching between UTF-8 and UTF-16 
depending on language, and was surprised to find it really didn't 
make a big difference. For one real world example, I looked at the 
Japanese translation of the XML specification included in the W3C XML 
test suite. The UTF-8 version is 202K. The UTF-16 version is 305K, 
50% larger! Of course, this can be highly dependent on the nature of 
the documents. An originally Japanese document with Japanese markup 
and no internal DTD subset might reverse these numbers, or at least 
bring them into parity.
-- 

   Elliotte Rusty Harold
   elharo@metalab.unc.edu
   Effective XML (Addison-Wesley, 2003)
   http://www.cafeconleche.org/books/effectivexml
   http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA

Received on Monday, 8 March 2004 08:47:22 UTC