Re: BOCU-1, SCSU, etc.

Disclaimer: Still not an official WG response.

On Jan 27, 2008, at 05:36, Brian Smith wrote:

> Henri Sivonen wrote:
>> It is possible, but I think that developing such encodings is
>> the wrong thing to do. UTF-8 can express all Unicode
>> characters, so new encodings will be incompatible with
>> existing software with no improvements in Unicode expressiveness.
>
> UTF-8 is only efficient for European languages. For non-European
> languages, BOCU and SCSU offer a significant savings:
> http://unicode.org/notes/tn6/tn6-1.html. UTF-8's design forces the
> people of the world with the least money to use the most network
> bandwidth and storage space.

It's a common mistake to compare compactness of Unicode-specific  
compression schemes against uncompressed UTF-8. Encoding HTML5 as  
UTF-8 and compressing the result using gzip on the HTTP layer already  
works and is backwards-compatible.

BOCU-1 and SCSU are not supported by current browsers, so even if  
support were added in the future, using them would force the people  
with the least money to upgrade their systems the soonest (presumably  
at non-trivial cost).

>>>> However, encoding proliferation is a  problem.
>
> If BOCU and/or SCSU were more widely supported, then legacy encodings
> like TIS-620 (Thai encoded as single bytes) could reasonably fade  
> away.

Legacy on the Web doesn't fade away enough to allow support to be  
dropped. UAs will always have to support TIS-620. So if you are  
concerned that gzipped UTF-8 isn't compact enough for Thai, gzipped  
TIS-620 will always be more compatible than BOCU-1 or SCSU (given that  
legacy software never fully fades away, either).

>> Developers are free to waste their time on encodings when
>> they do things in the RAM space of their own applications.
>> Communications on the public Web affect other people, so
>> developers who implement pointless stuff waste the time of
>> other developers as well when they need to interoperate with
>> the pointlessness.
>
> Encodings that offer savings over UTF-8 are not a waste of time.

The waste of time comment was particularly about UTF-32.

>>> But for some scripts and applications UTF-32 could be more straight
>>> forward than UTF-16.
>>
>> In some cases UTF-32 might be preferable in RAM. UTF-32 is
>> never preferable as an encoding for transferring over the
>> network. HTML5 encoded as UTF-8 is *always* more compact than
>> the same document encoded as UTF-16 or UTF-32 regardless of
>> the script of the content.
>
> UTF-8 is significantly less compact than SCSU/BOCU for most peoples'
> native languages.


For such arguments, gzip should always be considered and the  
compatibility benefits of UTF-8 + gzip be appreciated.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Sunday, 27 January 2008 09:45:39 UTC