[whatwg] Internal character encoding declaration from Henri Sivonen on 2006-03-14 (public-whatwg-archive@w3.org from March 2006)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 14 Mar 2006 11:35:52 +0200
Message-ID: <139664D0-2D7C-47A1-94FD-A34DA1671E46@iki.fi>
On Mar 14, 2006, at 10:03, Peter Karlsson wrote:

> Henri Sivonen on 2006-03-11:
>
>>>> I think it would be beneficial to additionally stipulate that
>>>> 1. The meta element-based character encoding information  
>>>> declaration is expected to work only if the Basic Latin range of  
>>>> characters maps to the same bytes as in the US-ASCII encoding.
>>> Is this realistic? I'm not really familiar enough with character  
>>> encodings to say if this is what happens in general.
>> I suppose it is realistic. See below.
>
> Yes, for most encodings, the US-ASCII range is the same, and if you  
> restrict it a bit further (the "INVARIANT" charset in RFC 1345), it  
> covers most of the ambiguous encodings.

It appears that the INVARIANT charset is not designed to be invariant  
under different Web-relevant encodings (e.g. stateful Asian encodings  
that use ESC and VISCII that assigns printable characters to the  
control range). Rather, the INVARIANT charset seems to be designed to  
be invariant under the various national variants of ISO-646, which  
used to be relevant to email until ten years ago but luckily have  
never been relevant to the Web.

(BTW, how Web-relevant is VISCII, really?)

>> Interestingly, transcoding proxies tend to be brought up by  
>> residents of Western Europe, North America or the Commonwealth. I  
>> have never seen a Russion person living in Russia or a Japanese  
>> person living in Japan talk about transcoding proxies in any  
>> online or offline discussion. That's why I doubt the importance of  
>> transcoding proxies.
>
> Transcoding is very popular, especially in Russia.

In *proxies* *today*? What's the point considering that browsers have  
supported the Cyrillic encoding soup *and* UTF-8 for years?

How could proxies properly transcode form submissions coming back  
without messing everything up spectacularly?

> With mod_charset in Apache it will (AFAICT) use the information in  
> the <meta> of the document to determine the source encoding and  
> then transcode it to an encoding it believes the client can handle  
> (based on browser sniffing).

I am aware of the Russian Apache project. A glance at the English  
docs suggests it is not reading the meta. In any case, Russian Apache  
is designed as a transcoding origin server--not a proxy.

> It transcodes on a byte level, so the <meta> reamains unchanged,  
> but is overridden by the HTTP header.

Not a fatal problem if the information on the HTTP layer is right  
(until saving to disk, that is).

In my opinion, operators of such servers should take care of not  
sending bogus metas.

>>> Character encoding information shouldn't be duplicated, IMHO,  
>>> that's just asking for trouble.
>> I suggest a mismatch be considered an easy parse error and,  
>> therefore, reportable.
>
> That will not work in the mod_charset case above.

Easy parse errors are not fatal in browsers. Surely it is OK for a  
conformance checker to complain that much at server operators whose  
HTTP layer and meta do not match.

> BOM-sniffing should be done *after* looking at the transport  
> layer's information. It might know something you don't. It's a part  
> of the "guessing-the-content" step.

Sure. The algorithm I suggested was intended for cases where there  
was no encoding information on the HTTP layer.

>> Documents must specify a character encoding an must use an IANA- 
>> registered encoding and must identify it using its preferred MIME  
>> name or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must  
>> recognize the preferred MIME name of every encoding they support  
>> that has a preferred MIME name. UAs should recognize IANA- 
>> registered aliases.
>
> That could be useful, the only problem being that the IANA list of  
> encoding labels is a bit difficult to read when you want to try  
> figuring out which name to write.

Authors only need to remember one: UTF-8. :-)

> I don't think forbidding BOCU-1 is a good idea. If there is ever a  
> proper specification written of it, it could be very useful as a  
> compression format for documents.

Is BOCU-1 so much smaller than UTF-8 with deflate compression on the  
HTTP layer that the gratuitous incompatibility could ever be justified?

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 14 March 2006 01:35:52 UTC