[whatwg] Internal character encoding declaration

Henri Sivonen on 2006-03-11:

>>> I think it would be beneficial to additionally stipulate that
>>> 1. The meta element-based character encoding information declaration 
>>> is expected to work only if the Basic Latin range of characters maps 
>>> to the same bytes as in the US-ASCII encoding.
>> Is this realistic? I'm not really familiar enough with character 
>> encodings to say if this is what happens in general.
> I suppose it is realistic. See below.

Yes, for most encodings, the US-ASCII range is the same, and if you restrict 
it a bit further (the "INVARIANT" charset in RFC 1345), it covers most of 
the ambiguous encodings. The others can be easily detected as they usually 
have very different bit patterns (EBCDIC) or word lengths (UTF-16, UTF-32).

>>> 2. If there is no external character encoding information nor a BOM 
>>> (see below), there MUST NOT be any non-ASCII bytes in the document 
>>> byte stream before the end of the meta element that declares the 
>>> character encoding. (In practice this would ban unescaped non-ASCII 
>>> class names on the html and [head] elements and non-ASCII comments at 
>>> the beginning of the document.)
>> Again, can we realistically require this? I need to do some studies of 
>> non-latin pages, I guess.
> As UA behavior, no. As a conformance requirement, maybe.

If you require browsers to switch on-the-fly, they can redo the decoding 
when they find the <meta> anyway, and this is no longer a problem. There are 
a lot of documents with non-ASCII-language comments and <title> tags that 
are positioned before the <meta>.

>>>> Authors should avoid including inline character encoding 
>>>> information. Character encoding information should instead be 
>>>> included at the transport level (e.g. using the HTTP Content-Type 
>>>> header).
>>> I disagree.
>>> With HTML with contemporary UAs, there is no real harm in including 
>>> the character encoding information both on the HTTP level and in the 
>>> meta as long as the information is not contradictory. On the contrary, 
>>> the author-provided internal information is actually useful when end 
>>> users save pages to disk using UAs that do not reserialize with 
>>> internal character encoding information.
>> ...and it breaks everything when you have a transcoding proxy, or 
>> similar.
> Well, not until you save to disk, since HTTP takes precedence. However, 
> authors can escape this by using UTF-8. (Assuming here that tampering with 
> UTF-8 would be harmful, wrong and pointless.)
>
> Interestingly, transcoding proxies tend to be brought up by residents of 
> Western Europe, North America or the Commonwealth. I have never seen a 
> Russion person living in Russia or a Japanese person living in Japan talk 
> about transcoding proxies in any online or offline discussion. That's why 
> I doubt the importance of transcoding proxies.

Transcoding is very popular, especially in Russia. With mod_charset in 
Apache it will (AFAICT) use the information in the <meta> of the document to 
determine the source encoding and then transcode it to an encoding it 
believes the client can handle (based on browser sniffing). It transcodes on 
a byte level, so the <meta> reamains unchanged, but is overridden by the 
HTTP header.

The <meta> tag is really information to the server, it is the server that is 
*supposed* to read it and post the data into the HTTP header. Unfortunately 
not many servers support that, leaving us with having to parse them in the 
browsers instead. Reading the <meta> tag for encoding information is 
basically at the same level as guessing the encoding by frequency 
analysis--The server didn't say anything so perhaps you can get lucky.

>> Character encoding information shouldn't be duplicated, IMHO, that's 
>> just asking for trouble.
> I suggest a mismatch be considered an easy parse error and, therefore, 
> reportable.

That will not work in the mod_charset case above.

>>>> For HTML, user agents must use the following algorithm in determining the
>>>> character encoding of a document:
>>>> 1. If the transport layer specifies an encoding, use that.
>>> Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; 
>>> UTF-32 makes no practical sense for interchange on the Web.)
>> I don't know, should there?
> I believe there should.

BOM-sniffing should be done *after* looking at the transport layer's 
information. It might know something you don't. It's a part of the 
"guessing-the-content" step.

> Requirements I'd like to see:
>
> Documents must specify a character encoding an must use an IANA-registered 
> encoding and must identify it using its preferred MIME name or use a BOM 
> (with UTF-8, UTF-16 or UTF-32). UAs must recognize the preferred MIME name 
> of every encoding they support that has a preferred MIME name. UAs should 
> recognize IANA-registered aliases.

That could be useful, the only problem being that the IANA list of encoding 
labels is a bit difficult to read when you want to try figuring out which 
name to write.

> Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE (i.e. 
> BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the EBCDIC 
> family of encodings. Documents using the UTF-16 or UTF-32 encodings must 
> have a BOM.

I don't think forbidding BOCU-1 is a good idea. If there is ever a proper 
specification written of it, it could be very useful as a compression format 
for documents.

> Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)

Yes, especially since encoding defintions tend to change over time.

> Authors are adviced to use the UTF-8 encoding. Authors are adviced not to 
> use the UTF-32 encoding or legacy encodings. (Note: I think UTF-32 on the 
> Web is harmful and utterly pointless, but Firefox and Opera support it.

UTF-32 can be useful as an internal format, but I agree that it's not very 
useful on the web. Not sure about the "harmful" bit, though.

-- 
\\//
Peter, software engineer, Opera Software

  The opinions expressed are my own, and not those of my employer.
  Please reply only by follow-ups on the mailing list.

Received on Tuesday, 14 March 2006 00:03:10 UTC