W3C home > Mailing lists > Public > www-tag@w3.org > June 2005

Re: Requiring UTF-8 for XML (was: RE: google sitemaps and some history of sitemaps [siteData-36])

From: Elliotte Harold <elharo@metalab.unc.edu>
Date: Sun, 12 Jun 2005 07:19:40 -0400
Message-ID: <42AC1A4C.30807@metalab.unc.edu>
To: Ivan Herman <ivan@w3.org>
CC: noah_mendelsohn@us.ibm.com, Dare Obasanjo <dareo@microsoft.com>, Dan Connolly <connolly@w3.org>, www-tag@w3.org, Paul Grosso <pgrosso@arbortext.com>, adamb@google.com

Ivan Herman wrote:
> As far as I know (though not being a Unicode expert), while UTF-8 is 
> probably the right encoding (in terms of size, processing, etc) for 
> documents that use predominantly Latin characters with occasional, say, 
> Chinese characters, it is definitely not true for documents 
> predominantly Chinese or Korean, for example. Those would prefer a 
> UTF-16 encoding, because it is much more efficient. In other words, 
> choosing UTF-8 over UTF-16 in this respect might clearly reflects a 
> linguistic bias... which is against the spirit of W3C recommendations.

If the document were pure Chinese/Japanese/Korean text that would be 
true. However in practice even a CJK XML document is likely to contain 
lots of ASCII characters: <, >, &, space, ", etc. Plus a lot of actual 
modern CJK text includes the digits 0-9 and often even entire words 
written in ASCII. How much space UTF-16 saves is likely to vary from one 
document to the next, but it's very unlikely to be 50%, and may be 
negligible in some cases.

To my way of thinking, the primary advantage of UTF-8 is not size. It's 
that UTF-8 is more compatible with many existing tools and programs that 
aren't really Unicode savvy. You can open up UTF-8 in an ASCII text 
editor or process it with a program that treats all text as chars and it 
won't be completely unintelligible. The same cannot be said for UTF-16. 
Of course, this isn't even close to perfect. A real Unicode aware 
program would do better. But it's a matter of degree.

Elliotte Rusty Harold  elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
Received on Sunday, 12 June 2005 11:19:45 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:32:46 UTC