[whatwg] Internal character encoding declaration from Peter Karlsson on 2006-03-14 (public-whatwg-archive@w3.org from March 2006)

From: Peter Karlsson <peter@opera.com>
Date: Tue, 14 Mar 2006 14:07:48 +0100 (CET)
Message-ID: <Pine.LNX.4.64.0603141351490.27998@peter.oslo.opera.com>
Henri Sivonen on 2006-03-14:

> It appears that the INVARIANT charset is not designed to be invariant 
> under different Web-relevant encodings (e.g. stateful Asian encodings that 
> use ESC and VISCII that assigns printable characters to the control 
> range). Rather, the INVARIANT charset seems to be designed to be invariant 
> under the various national variants of ISO-646, which used to be relevant 
> to email until ten years ago but luckily have never been relevant to the 
> Web.

True, but it is still relevant as a subset that is identical in most 
encodings. If you drop the "+" from it, you can include UTF-7 in the covered 
encodings as well, and I don't know of any IANA encoding labels with "+" in 
them, meaning it can be used for <meta> discovery.

> (BTW, how Web-relevant is VISCII, really?)

I can't remember having seen any web pages use it.

>> Transcoding is very popular, especially in Russia.
> In *proxies* *today*? What's the point considering that browsers have 
> supported the Cyrillic encoding soup *and* UTF-8 for years?

The mod_charset is not proxying, it's on the server level. A few years back 
browsers that did not support many character encodings were still popular 
(according to the statistics I have seen), but that has likely changed 
lately. mod_charset still remains and is in use, however.

> How could proxies properly transcode form submissions coming back without 
> messing everything up spectacularly?

That's why the "hidden-string" technique was invented. Introduce a hidden 
<input> with a character string that will get encoded differently depending 
on the encoding used. When data comes in, use this character string to 
determine what encoding was used.

> I am aware of the Russian Apache project. A glance at the English docs 
> suggests it is not reading the meta.

I haven't read the documentation, but I have seen pages being served in 
different character encodings in different browsers by Russian Apache 
servers, with the <meta> intact and indicating the original encoding. It is 
quite possible that the <meta> wasn't used anywhere.

> Not a fatal problem if the information on the HTTP layer is right (until 
> saving to disk, that is).

Exactly.

> Easy parse errors are not fatal in browsers. Surely it is OK for a 
> conformance checker to complain that much at server operators whose HTTP 
> layer and meta do not match.

I just reacted at the notion of calling such documents invalid. It is the 
transport layer that defines the encoding, whatever the document says or how 
it looks like is irrelevant, and is just something that you can look at if 
the transport layer neglects to say anything.

> Is BOCU-1 so much smaller than UTF-8 with deflate compression on the HTTP 
> layer that the gratuitous incompatibility could ever be justified?

I don't know, I haven't compared (but you should of course compare BOCU-1 
with deflate if you do).

-- 
\\//
Peter, software engineer, Opera Software

  The opinions expressed are my own, and not those of my employer.
  Please reply only by follow-ups on the mailing list.
Received on Tuesday, 14 March 2006 05:07:48 UTC