[whatwg] Default encoding to UTF-8? from Leif Halvard Silli on 2011-12-05 (public-whatwg-archive@w3.org from December 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Mon, 5 Dec 2011 18:42:39 +0100
Message-ID: <20111205184239429135.57045eb9@xn--mlform-iua.no>
L. David Baron on Wed Nov 30 18:29:31 PST 2011:
> On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote:
>> My understanding is that all browsers* default to Western Latin
>> (ISO-8859-1) encoding by default (for Western-world
>> downloads/OSes) due to legacy content on the web. But how relevant
>> is that still today? Has any browser done any recent research into
>> the need for this?
> 
> The default varies by localization (and within that potentially by
> platform), and unfortunately that variation does matter.  You can
> see Firefox's defaults here:
> http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default
> (The localization and platform are part of the filename.)

Last I checked, some of those locales defaulted to UTF-8. (And HTML5 
defines it the same.) So how is that possible? Don't users of those 
locales travel as much as you do? Or do we consider the English locale 
user's as more important? Something is broken in the logics here!

> I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago
> (by changing the "intl.charset.default" preference), and I do see a
> decent amount of broken content as a result (maybe I encounter a new
> broken page once a week? -- though substantially more often if I'm
> looking at non-English pages because of travel).

What kind of trouble are you actually describing here? You are 
describing a problem with using UTF-8 for *your locale*. What is your 
locale? It is probably English. Or do you consider your locale to be 
'the Western world locale'? It sounds like *that* is what Anne has in 
mind when he brings in Dutch: 
http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as 
if some see Latin-1 - or Windows-1251 as we now should say - as a 
'super default' rather than a locale default. If that is the case, that 
it is a super default, then we should also spec it like that! Until 
further, I'll treat Latin-1 as it is specced: As a default for certain 
locales.)

Since it is a locale problem, we need to understand which locale you 
have - and/or which locale you - and other debaters - think they have. 
Faruk probably uses a Spanish locale - right?, so the two of you are 
not speaking out of the same context. 

However, you also say that your problem is not so much related to pages 
written for *your* locale as it is related for pages written for users 
of *other* locales. So how many times per year do Dutch, Spanish or 
Norwegian  - and other non-English pages - are creating troubles for 
you, as a English locale user? I am making an assumption: Almost never. 
You don't read those languages, do you? 

This is also an expectation thing: If you visit a Russian page in a 
legacy Cyrillic encoding, and gets mojibake because your browser 
defaults to Latin-1, then what does it matter to you whether your 
browser defaults to Latin-1 or UTF-8? Answer: Nothing. 

>> I'm wondering if it might not be good to start encouraging
>> defaulting to UTF-8, and only fallback to Western Latin if it is
>> detected that the content is very old / served by old
>> infrastructure or servers, etc. And of course if the content is
>> served with an explicit encoding of Western Latin.
> 
> The more complex the rules, the harder they are for authors to
> understand / debug.  I wouldn't want to create rules like those.

Agree that that particular idea is probably not the best.
 
> I would, however, like to see movement towards defaulting to UTF-8:
> the current situation makes the Web less world-wide because pages
> that work for one user don't work for another.
> 
> I'm just not quite sure how to get from here to there, though, since
> such changes are likely to make users experience broken content.

I think we should 'attack' the dominating locale first: The English 
locale, in its different incarnations (Australian, American, UK). Thus, 
we should turn things on the head: English users should start to expect 
UTF-8 to be used. Because, as English users, you are more used to 
'mojibake' than the rest of us are: Whenever you see it, you 'know' 
that it is because it is a foreign language you are reading. It is we, 
the users of non-English locales, that need the default-to-legacy 
encoding behavior the most. Or, please, explain to us when and where it 
is important that English language users living in their own, native 
lands so to speak, need that their browser default to Latin-1 so that 
they can correctly read English language pages?

If the English locales start defaulting to UTF-8, then little by 
little, the same expectation etc will start spreading to the other 
locales as well, not least because the 'geeks' of each locale will tend 
to see the English locale as a super default - and they might also use 
the US English locale of their OS and/or browser. We should not 
consider the needs of geeks - they will follow (read: lead) the way, so 
the fact that *they* may see mojibake, should not be a concern.

See? We would have a plan. Or what do you think? Of course, we - or 
rather: the browser vendors - would need to market this as an important 
change. The HTML5 spec already justifies the use of UTF-8 several 
places - it says that pages might not work as expected e.g. w.r.t. 
URLs, unless UTF-8 is used. So there are enough of arguments that can 
be used.

There are other technical ideas I have, such as treating the BOM the 
way Webkit and IE treats it - that would increase the number of pages 
treated as UTF-8 by all browsers a little bit [1]. However that can 
wait or whatever: The most important thing is to *initiate* the default 
encoding change.

[1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

Leif Halvard Silli
Received on Monday, 5 December 2011 09:42:39 UTC