Re: What is the correct character encoding for http://www.w3.org/TR/REC-html40/html40.txt?

2013-05-28 15:32, Yi, EungJun wrote:
> I just have tried to get plain/text version of HTML 4.01 specification
> from http://www.w3.org/TR/REC-html40/html40.txt, but some characters
> were broken because its Content-Type header did not have charset
> parameter and my web browser tried to decode the document in
> ISO-8859-1 by HTTP/1.1 specification.

HTTP/1.1 is outdated in matters like this. It sets ISO-8859-1 as the 
default encoding for all subtypes of the "text" type, but what browsers 
actually do for "text/html" is more or less what HTML5 CR describes at
http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding
And it is natural to assume that they behave similarly, to the extent 
possible, for "text/plain". My browsers seem to end up with using 
windows-1252, the same as they default for "text/html".

> What is the correct character encoding for the document? My editor
> guesses the correct encoding is ISO-8859-15. Is it right?
>

I checked for all bytes outside the ASCII range in the document and got 
this result (the last character showing the byte as interpreted in 
ISO-8859-1):

Line   36 code 251 (octal), character: ©
Line   36 code 256 (octal), character: ®
Line  698 code 374 (octal), character: ü
Line  699 code 351 (octal), character: é
Line  713 code 344 (octal), character: ä
Line  717 code 345 (octal), character: å
Line 14774 code 367 (octal), character: ÷
Line 14787 code 251 (octal), character: ©
Line 15017 code 251 (octal), character: ©
Line 15284 code 251 (octal), character: ©
Line 16221 code 351 (octal), character: é
Line 16221 code 351 (octal), character: é
Line 16221 code 351 (octal), character: é
Line 16221 code 351 (octal), character: é
Line 16572 code 345 (octal), character: å
Line 17640 code 374 (octal), character: ü

I manually checked the lines, and the ISO-8859-1 interpretation is 
correct in all cases. The characters are copyright sign, registered 
sign, some Latin letters occurring in names and in a French phrases, and 
the divides sign “÷” (shown when mentioning the ÷ entity).

Interpreting the data as ISO-8859-15 leads to correct results, too, 
since all those characters share the same codes in the two ISO-8859 
standards. The same applies to windows-1252, as well as some other 
encodings, but not e.g. to ISO-8859-2 (which has no “å”).

So the interpretation (guess) will be right in many cases, but not all. 
In most cases, just some personal names will be distorted when the guess 
is wrong. But in the “÷” case, there will be wrong information: for some 
wrong guesses, the text says that ÷ means “ũ” or “ṫ” or “ś” or 
“χ”, for example. And “©” may become “Š” or something else.

So the HTTP headers should really be fixed to specify charset=windows-1252.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/

Received on Wednesday, 29 May 2013 05:49:44 UTC