W3C home > Mailing lists > Public > www-tag@w3.org > June 2005

Re: Requiring UTF-8 for XML (was: RE: google sitemaps and some history of sitemaps [siteData-36])

From: Elliotte Harold <elharo@metalab.unc.edu>
Date: Sun, 12 Jun 2005 08:21:50 -0400
Message-ID: <42AC28DE.8000400@metalab.unc.edu>
To: www-tag@w3.org

Another important feature of UTF-8 vs. UTF-16, irrespective of size 
issues. In UTF-8 you always know where you are. That is, given a single 
byte you can immediately determine if it is a single byte character, the 
first byte of a two-byte character, the second byte of a two-byte 
character, or the second or third or fourth byte of a three-or-four byte 
character. (That's not quite all the possibilities but you get the 
idea.) In UTF-16, you don't always know that the byte 0x41 is indeed the 
letter A. Sometimes it is and sometimes it isn't. You have to keep track 
of enough state to know where you are in the stream. If a single byte 
gets lost, all data from that point forward is corrupted, at least until 
another byte is lost.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim
Received on Sunday, 12 June 2005 12:21:56 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 26 April 2012 12:47:36 GMT