Re: Requiring UTF-8 for XML (was: RE: google sitemaps and some history of sitemaps [siteData-36])

Ivan Herman wrote:
> As far as I know (though not being a Unicode expert), while UTF-8 is 
> probably the right encoding (in terms of size, processing, etc) for 
> documents that use predominantly Latin characters with occasional, say, 
> Chinese characters, it is definitely not true for documents 
> predominantly Chinese or Korean, for example. Those would prefer a 
> UTF-16 encoding, because it is much more efficient. In other words, 
> choosing UTF-8 over UTF-16 in this respect might clearly reflects a 
> linguistic bias... which is against the spirit of W3C recommendations.
> 

If the document were pure Chinese/Japanese/Korean text that would be 
true. However in practice even a CJK XML document is likely to contain 
lots of ASCII characters: <, >, &, space, ", etc. Plus a lot of actual 
modern CJK text includes the digits 0-9 and often even entire words 
written in ASCII. How much space UTF-16 saves is likely to vary from one 
document to the next, but it's very unlikely to be 50%, and may be 
negligible in some cases.

To my way of thinking, the primary advantage of UTF-8 is not size. It's 
that UTF-8 is more compatible with many existing tools and programs that 
aren't really Unicode savvy. You can open up UTF-8 in an ASCII text 
editor or process it with a program that treats all text as chars and it 
won't be completely unintelligible. The same cannot be said for UTF-16. 
Of course, this isn't even close to perfect. A real Unicode aware 
program would do better. But it's a matter of degree.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim

Received on Sunday, 12 June 2005 11:19:45 UTC