Re: BOCU-1, SCSU, etc.

Disclaimer: Still not official WG response.

On Jan 27, 2008, at 20:12, Brian Smith wrote:

> Restrictions on the encoding of non-European languages is not  
> something
> that should be decided by people in Europe and the Americas. Since all
> the WHATWG members are European or American, and the W3C HTML 5  
> working
> group is almost entirely comprised of Westerners, the users of
> non-European languages are not being adequately represented. It seems
> the best that we can do is to avoid making arguments about the
> compactness of UTF-8 that only apply to our languages. In  
> particular, we
> cannot argue that UTF-8 has any compactness advantage because that is
> something that is not generally true.

I strongly reject the notion that being European or American makes one  
unqualified to assess a quantifiable matter such as the compactness of  
an encoding.

The most common claims about the non-compactness of UTF-8 turn out to  
be false when measured.

  * Typical Web pages (news sites, Wikipedia pages) contain so much  
markup that UTF-16 is not more compact than UTF-8 even for languages  
whose characters take 3 bytes in UTF-8.
  * Compared to UTF-8, BOCU-1 cannot even halve the size of such an  
Asian-language HTML page, which makes BOCU-1 alone much much less  
compact than UTF-8 plus gzip.
  * Compared to gzipping UTF-8 (as applied to such an Asian-language  
HTML page), encoding as BOCU-1 and then gzipping gains only 0 to 2  
percentage points which is a negligible compactness benefit compared  
to the installed base disadvantage (ubiquity vs. nothingness).

> But, if some group of users prefers to use a Unicode encoding
> optimized for their language, instead of GZIP, then that is their
> prerogative.

The network effects of Web-facing software are global as are the  
implementation concerns. It isn't a private matter.

> Right now, there are a lot of systems where it is cheaper/faster to
> implement SCSU-like encodings than it is to implement UTF-8+gzip,
> because gzip is expensive. J2ME is one example that is currently  
> widely
> deployed.

J2ME HTML5 UAs are most likely to use the Opera Mini architecture in  
which case the origin server doesn't talk to the J2ME thin client, so  
the point would be moot even if gzip were prohibitively expensive on  
J2ME.

> XML has been very successful in how it has handled encodings. Any
> statements that go restrict what the XML 1.0 specification says are
> unwarranted. The specification just needs to recommend UTF-8 because  
> it
> is the most interoperable Unicode-capable encoding we have today.


It would be inappropriate for HTML 5 to ban what XML 1.0 requires  
(UTF-8 and UTF-16). However, restricting the open-endedness of XML  
encodings can only be good for interop.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Sunday, 27 January 2008 21:09:12 UTC