Re: BOCU-1, SCSU, etc. from Henri Sivonen on 2008-01-29 (public-html-comments@w3.org from January 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 29 Jan 2008 11:24:34 +0200
To: Brian Smith <brian@briansmith.org>
Cc: <public-html-comments@w3.org>
Message-Id: <6EDFC318-60EC-460E-883A-93FF81774528@iki.fi>
Disclaimer: Still not a WG response.

On Jan 28, 2008, at 20:38, Brian Smith wrote:

> Maybe we have a misunderstanding. Does the qualification "If the
> document does not start with a BOM, and if its encoding is not
> explicitly given by Content-Type metadata..." apply to the restriction
> against BOCU-1 and SCSU? If so, then I have no argument with that,
> except that the wording should be clearer. However, right now the
> wording reads as though an HTML 5 document must never be in one of  
> these
> encodings, even if the encoding is denoted in the Content-Type  
> header or
> the BOM.

My understanding is that HTML 5 bans these post-UTF-8 second-system  
Unicode encodings no matter where you might declare the use.

> Henri Sivonen wrote:
>> The most common claims about the non-compactness of UTF-8
>> turn out to be false when measured.
>
> I agree that for Wikipedia and news sites, there isn't an huge  
> advantage
> in using SCSU or BOCU.  But, web browsers, Wikipedia, and news sites  
> are
> not the only applications of HTML.

Running wide-scale encoding studies isn't my job really. Since I  
happen to be interested in this subject matter, I studied the matter  
on a hobby basis my own time without automation on a small scale. (I  
did test more than one language, though: Thai, Malayalam, Tamil,  
Japanese and Chinese.) This together with my even more ad hoc prior  
tests with real Japanese pages and measuring UTF-8 vs. UTF-16 was  
enough to convince me that in the presence of real-world markup, the  
conventional wisdom about the non-compactness of UTF-8 is just a myth  
that needs to be busted.

The applications of HTML that need to be considered are the  
applications on the public Web. I would be interesting to see a large- 
scale study of the compactness of UTF-8 vs. UTF-16 vs. BOCU-1 vs. SCSU  
vs. well-supported applicable legacy encodings vs. and all of them  
gzipped as applied to real-world *Web* content. (Running tests on  
plain-text novels from Project Gutenberg or on the Bible are wrong if  
the goal is to study applicability to general Web content.) That might  
be an interesting 20% time project for a Googler.

>>> Right now, there are a lot of systems where it is cheaper/faster to
>>> implement SCSU-like encodings than it is to implement UTF-8+gzip,
>>> because gzip is expensive. J2ME is one example that is currently
>>> widely deployed.
>>
>> J2ME HTML5 UAs are most likely to use the Opera Mini
>> architecture in which case the origin server doesn't talk to
>> the J2ME thin client, so the point would be moot even if gzip
>> were prohibitively expensive on J2ME.
>
> Not all applications that use HTML are general purpose web browsers.  
> If
> an HTML document or fragment is not going to be directly processed  
> by a
> general-purpose web browser, then why do we need to restrict its
> encoding?

The purpose of the HTML 5 spec is to improve interoperability between  
Web browsers as used with content and Web apps published on the one  
public Web. The normative language in the spec is concerned with  
publishing and consuming content and apps on the Web. The purpose of  
the spec isn't to lower the R&D cost of private and proprietary  
systems by producing reusable bits.

If people want to reuse HTML 5 as part of private and proprietary  
systems, they are free to do so, but then the conformance requirements  
designed for promoting interop don't matter anyway, so such a system  
doesn't need to conform and the conformance requirements don't need to  
be relaxed to make such a system conform.

> When I was writing Thai language software on J2ME phones, GZIP
> compression was too expensive (in code size, memory, and time), and
> UTF-8 significantly expanded the size of the text. I used TIS-620 for
> the prototype so I could cache more data on the phone, with the
> intention of migrating to SCSU or BOCU later. Since the only  
> encoding I
> can rely on with J2ME is UTF-8, I had to write my own encoders/ 
> decoders,
> but that was still easier than implementing gzip compression and
> decompression for severely memory-constrained devices. Once I had the
> encoder and decoder written, I decided to use it for everything, since
> it made everything smaller without requiring compression.

Clearly, such a device cannot host a useful HTML 5-enabled Web browser  
(only a thing display client like Opera Mini), so the point is moot as  
far as HTML 5 goes.

> Even after Unicode and the UTF encodings, new encodings are still  
> being
> created.

Deploying such encodings on the public network is a colossally bad  
idea. (My own nation has engaged in this folly with ISO-8859-15, so  
I've seen the bad consequences at home, too.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 29 January 2008 09:25:01 UTC