[whatwg] Default encoding to UTF-8? from Jukka K. Korpela on 2011-12-06 (public-whatwg-archive@w3.org from December 2011)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Tue, 06 Dec 2011 10:39:45 +0200
Message-ID: <4EDDD4D1.4030206@cs.tut.fi>
2011-12-06 6:54, Leif Halvard Silli wrote:

> Yeah, it would be a pity if it had already become an widespread
> cargo-cult to - all at once - use HTML5 doctype without using UTF-8
> *and* without using some encoding declaration *and* thus effectively
> relying on the default locale encoding ... Who does have a data corpus?

I think we wound need to ask search engine developers about that, but 
what is this proposed change to defaults supposed to achieve. It would 
break any old page that does not specify the encoding, as soon as the 
the doctype is changed to <!doctype html> or this doctype is added to a 
page that lacked a doctype.

Since <!doctype html> is the simplest way to put browsers to "standards 
mode", this would punish authors who have realized that their page works 
better in "standards mode" but are unaware of a completely different and 
fairly complex problem. (Basic character encoding issues are of course 
not that complex to you and me or most people around here; but most 
authors are more or less confused with them, and I don't think we should 
add to the confusion.)

There's a little point in changing the specs to say something very 
different from what previous HTML specs have said and from actual 
browser behavior. If the purpose is to make things more exactly defined 
(a fixed encoding vs. implementation-defined), then I think such 
exactness is a luxury we cannot afford. Things would be all different if 
we were designing a document format from scratch, with no existing 
implementations and no existing usage. If the purpose is UTF-8 
evangelism, then it would be just the kind of evangelism that produces 
angry people, not converts.

If there's something that should be added to or modified in the 
algorithm for determining character encoding, the I'd say it's error 
processing. I mean user agent behavior when it detects, after running 
the algorithm, when processing the document data, that there is a 
mismatch between them. That is, that the data contains octets or octet 
sequences that are not allowed in the encoding or that denote 
noncharacters. Such errors are naturally detected when the user agent 
processes the octets; the question is what the browser should do then.

When data that is actually in ISO-8859-1 or some similar encoding has 
been mislabeled as UTF-8 encoded, then, if the data contains octets 
outside the ASCII, character-level errors are likely to occur. Many 
ISO-8859-1 octets are just not possible in UTF-8 data. The converse 
error may also cause character-level errors. And these are not uncommon 
situations - they seem occur increasingly often, partly due to cargo 
cult "use of UTF-8" (when it means declaring UTF-8 but not actually 
using it, or vice versa), partly due increased use of UTF-8 combined 
with ISO-8859-1 encoded data creeping in from somewhere into UTF-8 
encoded data.

 From the user's point of view, the character-level errors currently 
result is some gibberish (e.g., some odd box appearing instead of a 
character, in one place) or in total mess (e.g. a large number non-ASCII 
characters displayed all wrong). In either case, I think an error should 
be signalled to the user, together with
a) automatically trying another encoding, such as the locale default 
encoding instead of UTF-8 or UTF-8 instead of anything else
b) suggesting to the user that he should try to view the page using some 
other encoding, possibly with a menu of encodings offered as part of the 
error explanation
c) a combination of the above.

Although there are good reasons why browsers usually don't give error 
messages, this would be a special case. It's about the primary 
interpretation of the data in the document and about a situation where 
some data has no interpretation in the assumed encoding - but usually 
has an interpretation in some other encoding.

The current "Character encoding overrides" rules are questionable 
because they often mask out data errors that would have helped to detect 
problems that can be solved constructively. For example, if data labeled 
as ISO-8859-1 contains an octet in the 80...9F range, then it may well 
be the case that the data is actually windows-1252 encoded and the 
"override" helps everyone. But it may also be the case that the data is 
in a different encoding and that the "override" therefore results in 
gibberish shown to the user, with no hint of the cause of the problem. 
It would therefore be better to signal a problem to the user, display 
the page using the windows-1252 encoding but with some instruction or 
hint on changing the encoding. And a browser should in this process 
really analyze whether the data can be windows-1252 encoded data that 
contains only characters permitted in HTML.

Yucca
Received on Tuesday, 6 December 2011 00:39:45 UTC