RE: BOCU-1, SCSU, etc. from Brian Smith on 2008-01-27 (public-html-comments@w3.org from January 2008)

From: Brian Smith <brian@briansmith.org>
Date: Sat, 26 Jan 2008 19:36:15 -0800
To: "'Henri Sivonen'" <hsivonen@iki.fi>, "'Frank Ellermann'" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Cc: <public-html-comments@w3.org>
Message-ID: <002101c86095$c69ed230$0401a8c0@T60>

Henri Sivonen wrote:
> It is possible, but I think that developing such encodings is 
> the wrong thing to do. UTF-8 can express all Unicode 
> characters, so new encodings will be incompatible with 
> existing software with no improvements in Unicode expressiveness.

UTF-8 is only efficient for European languages. For non-European
languages, BOCU and SCSU offer a significant savings:
http://unicode.org/notes/tn6/tn6-1.html. UTF-8's design forces the
people of the world with the least money to use the most network
bandwidth and storage space. 

> >> However, encoding proliferation is a  problem.

If BOCU and/or SCSU were more widely supported, then legacy encodings
like TIS-620 (Thai encoded as single bytes) could reasonably fade away. 

> Developers are free to waste their time on encodings when 
> they do things in the RAM space of their own applications. 
> Communications on the public Web affect other people, so 
> developers who implement pointless stuff waste the time of 
> other developers as well when they need to interoperate with 
> the pointlessness.

Encodings that offer savings over UTF-8 are not a waste of time.

> > But for some scripts and applications UTF-32 could be more straight 
> > forward than UTF-16.
> 
> In some cases UTF-32 might be preferable in RAM. UTF-32 is 
> never preferable as an encoding for transferring over the 
> network. HTML5 encoded as UTF-8 is *always* more compact than 
> the same document encoded as UTF-16 or UTF-32 regardless of 
> the script of the content.  

UTF-8 is significantly less compact than SCSU/BOCU for most peoples'
native languages.

- Brian

Received on Sunday, 27 January 2008 03:36:27 UTC