Ivan Herman wrote: > As far as I know (though not being a Unicode expert), while UTF-8 is > probably the right encoding (in terms of size, processing, etc) for > documents that use predominantly Latin characters with occasional, say, > Chinese characters, it is definitely not true for documents > predominantly Chinese or Korean, for example. Those would prefer a > UTF-16 encoding, because it is much more efficient. In other words, > choosing UTF-8 over UTF-16 in this respect might clearly reflects a > linguistic bias... which is against the spirit of W3C recommendations. > If the document were pure Chinese/Japanese/Korean text that would be true. However in practice even a CJK XML document is likely to contain lots of ASCII characters: <, >, &, space, ", etc. Plus a lot of actual modern CJK text includes the digits 0-9 and often even entire words written in ASCII. How much space UTF-16 saves is likely to vary from one document to the next, but it's very unlikely to be 50%, and may be negligible in some cases. To my way of thinking, the primary advantage of UTF-8 is not size. It's that UTF-8 is more compatible with many existing tools and programs that aren't really Unicode savvy. You can open up UTF-8 in an ASCII text editor or process it with a program that treats all text as chars and it won't be completely unintelligible. The same cannot be said for UTF-16. Of course, this isn't even close to perfect. A real Unicode aware program would do better. But it's a matter of degree. -- Elliotte Rusty Harold elharo@metalab.unc.edu XML in a Nutshell 3rd Edition Just Published! http://www.cafeconleche.org/books/xian3/ http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosimReceived on Sunday, 12 June 2005 11:19:45 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 20 September 2007 13:53:01 GMT