- From: Elliotte Harold <elharo@metalab.unc.edu>
- Date: Sun, 12 Jun 2005 07:19:40 -0400
- To: Ivan Herman <ivan@w3.org>
- CC: noah_mendelsohn@us.ibm.com, Dare Obasanjo <dareo@microsoft.com>, Dan Connolly <connolly@w3.org>, www-tag@w3.org, Paul Grosso <pgrosso@arbortext.com>, adamb@google.com
Ivan Herman wrote: > As far as I know (though not being a Unicode expert), while UTF-8 is > probably the right encoding (in terms of size, processing, etc) for > documents that use predominantly Latin characters with occasional, say, > Chinese characters, it is definitely not true for documents > predominantly Chinese or Korean, for example. Those would prefer a > UTF-16 encoding, because it is much more efficient. In other words, > choosing UTF-8 over UTF-16 in this respect might clearly reflects a > linguistic bias... which is against the spirit of W3C recommendations. > If the document were pure Chinese/Japanese/Korean text that would be true. However in practice even a CJK XML document is likely to contain lots of ASCII characters: <, >, &, space, ", etc. Plus a lot of actual modern CJK text includes the digits 0-9 and often even entire words written in ASCII. How much space UTF-16 saves is likely to vary from one document to the next, but it's very unlikely to be 50%, and may be negligible in some cases. To my way of thinking, the primary advantage of UTF-8 is not size. It's that UTF-8 is more compatible with many existing tools and programs that aren't really Unicode savvy. You can open up UTF-8 in an ASCII text editor or process it with a program that treats all text as chars and it won't be completely unintelligible. The same cannot be said for UTF-16. Of course, this isn't even close to perfect. A real Unicode aware program would do better. But it's a matter of degree. -- Elliotte Rusty Harold elharo@metalab.unc.edu XML in a Nutshell 3rd Edition Just Published! http://www.cafeconleche.org/books/xian3/ http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim
Received on Sunday, 12 June 2005 11:19:45 UTC