RE: BOCU-1, SCSU, etc. from Brian Smith on 2008-01-27 (public-html-comments@w3.org from January 2008)

From: Brian Smith <brian@briansmith.org>
Date: Sun, 27 Jan 2008 10:12:06 -0800
To: <public-html-comments@w3.org>
Message-ID: <001801c86110$23a07b30$0401a8c0@T60>
Henri Sivonen wrote:
> Brian Smith wrote:
> > Henri Sivonen wrote:
> > UTF-8 is only efficient for European languages. For non-European 
> > languages, BOCU and SCSU offer a significant savings:
> > http://unicode.org/notes/tn6/tn6-1.html. UTF-8's design forces the 
> > people of the world with the least money to use the most network 
> > bandwidth and storage space.
> 
> It's a common mistake to compare compactness of 
> Unicode-specific compression schemes against uncompressed 
> UTF-8. Encoding HTML5 as
> UTF-8 and compressing the result using gzip on the HTTP layer 
> already works and is backwards-compatible.
> 
> BOCU-1 and SCSU are not supported by current browsers, so 
> even if support were added in the future, using them would 
> force the people with the least money to upgrade their 
> systems the soonest (presumably at non-trivial cost).

Restrictions on the encoding of non-European languages is not something
that should be decided by people in Europe and the Americas. Since all
the WHATWG members are European or American, and the W3C HTML 5 working
group is almost entirely comprised of Westerners, the users of
non-European languages are not being adequately represented. It seems
the best that we can do is to avoid making arguments about the
compactness of UTF-8 that only apply to our languages. In particular, we
cannot argue that UTF-8 has any compactness advantage because that is
something that is not generally true.

> >> In some cases UTF-32 might be preferable in RAM. UTF-32 is never 
> >> preferable as an encoding for transferring over the network. HTML5 
> >> encoded as UTF-8 is *always* more compact than the same document 
> >> encoded as UTF-16 or UTF-32 regardless of the script of 
> the content.
> >
> > UTF-8 is significantly less compact than SCSU/BOCU for most peoples'
> > native languages.
> 
> For such arguments, gzip should always be considered and the 
> compatibility benefits of UTF-8 + gzip be appreciated.

I agree. But, if some group of users prefers to use a Unicode encoding
optimized for their language, instead of GZIP, then that is their
prerogative. At the very least, the part of the specification that
recommends against using encodings optimized for Asian languages should
be reviewed/written by the people most directly affected by it.

Right now, there are a lot of systems where it is cheaper/faster to
implement SCSU-like encodings than it is to implement UTF-8+gzip,
because gzip is expensive. J2ME is one example that is currently widely
deployed.

XML has been very successful in how it has handled encodings. Any
statements that go restrict what the XML 1.0 specification says are
unwarranted. The specification just needs to recommend UTF-8 because it
is the most interoperable Unicode-capable encoding we have today. 

- Brian
Received on Sunday, 27 January 2008 18:12:19 UTC