RE: BOCU-1, SCSU, etc. from Brian Smith on 2008-01-28 (public-html-comments@w3.org from January 2008)

From: Brian Smith <brian@briansmith.org>
Date: Mon, 28 Jan 2008 10:38:36 -0800
To: <public-html-comments@w3.org>
Message-ID: <000001c861dd$00bf8fc0$0501a8c0@T60>

Maybe we have a misunderstanding. Does the qualification "If the
document does not start with a BOM, and if its encoding is not
explicitly given by Content-Type metadata..." apply to the restriction
against BOCU-1 and SCSU? If so, then I have no argument with that,
except that the wording should be clearer. However, right now the
wording reads as though an HTML 5 document must never be in one of these
encodings, even if the encoding is denoted in the Content-Type header or
the BOM.

Henri Sivonen wrote:
> The most common claims about the non-compactness of UTF-8 
> turn out to be false when measured.

I agree that for Wikipedia and news sites, there isn't an huge advantage
in using SCSU or BOCU.  But, web browsers, Wikipedia, and news sites are
not the only applications of HTML.

> > Right now, there are a lot of systems where it is cheaper/faster to 
> > implement SCSU-like encodings than it is to implement UTF-8+gzip, 
> > because gzip is expensive. J2ME is one example that is currently 
> > widely deployed.
> 
> J2ME HTML5 UAs are most likely to use the Opera Mini 
> architecture in which case the origin server doesn't talk to 
> the J2ME thin client, so the point would be moot even if gzip 
> were prohibitively expensive on J2ME.

Not all applications that use HTML are general purpose web browsers. If
an HTML document or fragment is not going to be directly processed by a
general-purpose web browser, then why do we need to restrict its
encoding?

When I was writing Thai language software on J2ME phones, GZIP
compression was too expensive (in code size, memory, and time), and
UTF-8 significantly expanded the size of the text. I used TIS-620 for
the prototype so I could cache more data on the phone, with the
intention of migrating to SCSU or BOCU later. Since the only encoding I
can rely on with J2ME is UTF-8, I had to write my own encoders/decoders,
but that was still easier than implementing gzip compression and
decompression for severely memory-constrained devices. Once I had the
encoder and decoder written, I decided to use it for everything, since
it made everything smaller without requiring compression.

Besides everything that I have said, I don't see how it is practical for
HTML 5 to have a blacklist of encodings that should not be supported.
Even after Unicode and the UTF encodings, new encodings are still being
created. A list of that HTML processors are required to support make
more sense. Then the restriction can be rewritten to "Don't use
encodings that are not supported by your software."

- Brian

Received on Monday, 28 January 2008 18:38:48 UTC