- From: Brian Smith <brian@briansmith.org>
- Date: Sun, 27 Jan 2008 10:12:06 -0800
- To: <public-html-comments@w3.org>
Henri Sivonen wrote: > Brian Smith wrote: > > Henri Sivonen wrote: > > UTF-8 is only efficient for European languages. For non-European > > languages, BOCU and SCSU offer a significant savings: > > http://unicode.org/notes/tn6/tn6-1.html. UTF-8's design forces the > > people of the world with the least money to use the most network > > bandwidth and storage space. > > It's a common mistake to compare compactness of > Unicode-specific compression schemes against uncompressed > UTF-8. Encoding HTML5 as > UTF-8 and compressing the result using gzip on the HTTP layer > already works and is backwards-compatible. > > BOCU-1 and SCSU are not supported by current browsers, so > even if support were added in the future, using them would > force the people with the least money to upgrade their > systems the soonest (presumably at non-trivial cost). Restrictions on the encoding of non-European languages is not something that should be decided by people in Europe and the Americas. Since all the WHATWG members are European or American, and the W3C HTML 5 working group is almost entirely comprised of Westerners, the users of non-European languages are not being adequately represented. It seems the best that we can do is to avoid making arguments about the compactness of UTF-8 that only apply to our languages. In particular, we cannot argue that UTF-8 has any compactness advantage because that is something that is not generally true. > >> In some cases UTF-32 might be preferable in RAM. UTF-32 is never > >> preferable as an encoding for transferring over the network. HTML5 > >> encoded as UTF-8 is *always* more compact than the same document > >> encoded as UTF-16 or UTF-32 regardless of the script of > the content. > > > > UTF-8 is significantly less compact than SCSU/BOCU for most peoples' > > native languages. > > For such arguments, gzip should always be considered and the > compatibility benefits of UTF-8 + gzip be appreciated. I agree. But, if some group of users prefers to use a Unicode encoding optimized for their language, instead of GZIP, then that is their prerogative. At the very least, the part of the specification that recommends against using encodings optimized for Asian languages should be reviewed/written by the people most directly affected by it. Right now, there are a lot of systems where it is cheaper/faster to implement SCSU-like encodings than it is to implement UTF-8+gzip, because gzip is expensive. J2ME is one example that is currently widely deployed. XML has been very successful in how it has handled encodings. Any statements that go restrict what the XML 1.0 specification says are unwarranted. The specification just needs to recommend UTF-8 because it is the most interoperable Unicode-capable encoding we have today. - Brian
Received on Sunday, 27 January 2008 18:12:19 UTC