Re: Requiring UTF-8 for XML (was: RE: google sitemaps and some history of sitemaps [siteData-36]) from Elliotte Harold on 2005-06-12 (www-tag@w3.org from June 2005)

From: Elliotte Harold <elharo@metalab.unc.edu>
Date: Sun, 12 Jun 2005 08:21:50 -0400
To: www-tag@w3.org
Message-ID: <42AC28DE.8000400@metalab.unc.edu>

Another important feature of UTF-8 vs. UTF-16, irrespective of size 
issues. In UTF-8 you always know where you are. That is, given a single 
byte you can immediately determine if it is a single byte character, the 
first byte of a two-byte character, the second byte of a two-byte 
character, or the second or third or fourth byte of a three-or-four byte 
character. (That's not quite all the possibilities but you get the 
idea.) In UTF-16, you don't always know that the byte 0x41 is indeed the 
letter A. Sometimes it is and sometimes it isn't. You have to keep track 
of enough state to know where you are in the stream. If a single byte 
gets lost, all data from that point forward is corrupted, at least until 
another byte is lost.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim

Received on Sunday, 12 June 2005 12:21:56 UTC