[whatwg] Default encoding to UTF-8? from Jukka K. Korpela on 2011-12-06 (public-whatwg-archive@w3.org from December 2011)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Tue, 06 Dec 2011 23:09:28 +0200
Message-ID: <4EDE8488.7090600@cs.tut.fi>
2011-12-06 15:59, NARUSE, Yui wrote:

> (2011/12/06 17:39), Jukka K. Korpela wrote:
>> 2011-12-06 6:54, Leif Halvard Silli wrote:
>>
>>> Yeah, it would be a pity if it had already become an widespread
>>> cargo-cult to - all at once - use HTML5 doctype without using UTF-8
>>> *and* without using some encoding declaration *and* thus effectively
>>> relying on the default locale encoding ... Who does have a data corpus?
>
> I found it: http://rink77.web.fc2.com/html/metatagu.html

I'm not sure of the intended purpose of that demo page, but it seems to 
illustrate my point.

> It uses HTML5 doctype and not declare encoding and its encoding is Shift_JIS,
> the default encoding of Japanese locale.

My Firefox uses the ISO-8859-1 encoding, my IE the windows-1252 
encoding, resulting in a mess of course. But the point is that both 
interpretations mean data errors at the character level - even seen as 
windows-1252, it contains bytes with no assigned meaning (e.g., 0x81 is 
UNDEFINED).

> Current implementations replaces such an invalid octet with a replacement character.

No, it varies by implementation.

>> When data that is actually in ISO-8859-1 or some similar encoding has been mislabeled as UTF-8 	encoded, then, if the data contains octets outside the ASCII, character-level errors are likely to occur. Many ISO-8859-1 octets are just not possible in UTF-8 data. The converse error may also cause character-level errors. And these are not uncommon situations - they seem occur increasingly often, partly due to cargo cult "use of UTF-8" (when it means declaring UTF-8 but not actually using it, or vice versa), partly due increased use of UTF-8 combined with ISO-8859-1 encoded data creeping in from somewhere into UTF-8 encoded data.
>
> In such case, the page should be failed to show on the author's environment.

An authoring tool should surely indicate the problem. But what should 
user agents do when they face such documents and need to do something 
with them?

>>  From the user's point of view, the character-level errors currently result is some gibberish (e.g., some odd box appearing instead of a character, in one place) or in total mess (e.g. a large number non-ASCII characters displayed all wrong). In either case, I think an error should be signalled to the user, together with
>> a) automatically trying another encoding, such as the locale default encoding instead of UTF-8 or UTF-8 instead of anything else
>> b) suggesting to the user that he should try to view the page using some other encoding, possibly with a menu of encodings offered as part of the error explanation
>> c) a combination of the above.
>
> This premises that a user know the correct encoding.

Alternative b) means that the user can try some encodings. A user agent 
could give a reasonable list of options.

Consider the example document mentioned. When viewed in a Western 
environment, it probably looks all gibberish. Alternative a) would 
probably not help, but alternative b) would have some chances. If the 
user has some reason to suspect that the page might be in Japanese, he 
would probably try the Japanese encodings in the browser's list of 
encodings, and this would make the document readable after a try or two.

> I, Japanese, imagine that it is hard that distingusih ISO-8859-1 page and ISO-8859-2 page.

Yes, but the idea isn't really meant to apply to such cases, as there is 
no way to detect _at the character encoding level_ to recognize 
ISO-8859-1 mislabeled as ISO-8859-2 or vice versa.

> Some browsers alerts scripting issues.
> Why they cannot alerts an encoding issue?

Surely they could, though I was not thinking an alert in a popup sense - 
rather, a red error indicator somewhere. There would be many more 
reasons to signal encoding issues than to signal scripting issues, as we 
know that web pages generally contain loads of client-side scripting 
errors that do not actually affect page rendering or functionality.

>> The current "Character encoding overrides" rules are questionable because they often mask out data errors that would have helped to detect problems that can be solved constructively. For example, if data labeled as ISO-8859-1 contains an octet in the 80...9F range, then it may well be the case that the data is actually windows-1252 encoded and the "override" helps everyone. But it may also be the case that the data is in a different encoding and that the "override" therefore results in gibberish shown to the user, with no hint of the cause of the problem.
>
> I think such case doesn't exist.
> On character encoding overrides a superset overrides a standard set.

Technically, not quite so (e.g., in ISO-8859-1, 0x81 is U+0081, a 
control character that is not allowed in HTML - I suppose, though I 
cannot really find a statement on this in HTML5 - whereas in 
windows-1252, it is undefined).

More importantly my point was about errors in data, resulting e.g. from 
a faulty code conversion or some malfunctioning software that has 
produced, say, a document containing 0x80 in a document intended to be, 
and declared to be, in ISO-8859-1 encoding. Adequate processing, in my 
opinion, would consist of signalling the data error suitably, together 
with implicit change to windows-1252 _if_ the document as a whole 
consists of bytes defined in windows-1252.

>> It would therefore be better to signal a problem to the user, display the page using the windows-1252 encoding but with some instruction or hint on changing the encoding. And a browser should in this process really analyze whether the data can be windows-1252 encoded data that contains only characters permitted in HTML.
>
> Such verification should be done by developer tools, not production browsers
> which is widely used by real users.

Many things should be done by developer tools but won't be done, and we 
have all the existing content on the Web, often violating many standards 
and specifications. We should not punish users for this but make 
reasonable efforts in error recovery - but _not_ mechanical and silent 
error recovery in situations where we know that the guesses made will 
often be wrong.

Yucca
Received on Tuesday, 6 December 2011 13:09:28 UTC