[whatwg] Default encoding to UTF-8?

Boris Zbarsky Mon Dec 5 13:49:45 PST 2011:
> On 12/5/11 12:42 PM, Leif Halvard Silli wrote:
>> Last I checked, some of those locales defaulted to UTF-8. (And HTML5
>> defines it the same.) So how is that possible?
> 
> Because authors authoring pages that users of those locales  
> tend to use use UTF-8 more than anything else?

It is more likely that there is another reason, IMHO: They may have 
tried it, and found that it worked OK. But they of course have the same 
need for reading non-English museum and railway pages as Mozilla 
employees.

>> Don't users of those locales travel as much as you do?

> I think you completely misunderstood his 
> comments about travel and locales.  Keep reading.

I'm pretty sure I haven't misunderstood very much.

>> What kind of trouble are you actually describing here? You are
>> describing a problem with using UTF-8 for *your locale*.
> 
> No.  He's describing a problem using UTF-8 to view pages that are not 
> written in English.

And why is that a problem in those cases when it is a problem? Do he 
read those languages, anyway? Don't we expect some problems when we 
thread out of our borders?
 
> Now what language are the non-English pages you look at written in? 
> Well, it depends.  In western Europe they tend to be in languages that 
> can be encoded in ISO-8859-1, so authors sometimes use that encoding 
> (without even realizing it).  If you set your browser to default to 
> UTF-8, those pages will be broken.
> 
> In Japan, a number of pages are authored in Shift_JIS.  Those will 
> similarly be broken in a browser defaulting to UTF-8.

The solution I proposed was that English locale browsers should default 
to UTF-8. Of course, to such users, then "when in Japan", they could 
get problems - on some Japanese pages, which is a small nuisance, 
especially if they read Japansese.

>> What is your locale?
> 
> Why does it matter?  David's default locale is almost certainly en-US, 
> which defaults to ISO-8859-1 (or whatever Windows-??? encoding that 
> actually means on the web) in his browser.  But again, he's changed the 
> default encoding from the locale default, so the locale is irrelevant.

The locale is meant to predominantly be used within a physical locale. 
If he is at another physical locale or a virtually other locale, he 
should not be expecting that it works out of the box unless a common 
encoding is used. Even today, if he visits Japan, he has to either 
change his browser settings *or* to rely on the pages declaring their 
encodings. So nothing would change, for him, when visiting Japan ? with 
his browser or with his computer.

Yes, there would be a change, w.r.t. Enlgish quotation marks (see 
below) and w.r.tg. visiting Western European languages pages: For those 
a number of pages which doesn't fail with Win-1252 as the default, 
would start to fail. But relatively speaking, it is less important that 
non-English pages fail for the English locale.

>> (Quite often it sounds as
>> if some see Latin-1 - or Windows-1251 as we now should say - as a
>> 'super default' rather than a locale default. If that is the case, that
>> it is a super default, then we should also spec it like that! Until
>> further, I'll treat Latin-1 as it is specced: As a default for certain
>> locales.)
> 
> That's exactly what it is.

A default for certain locales? Right.

>> Since it is a locale problem, we need to understand which locale you
>> have - and/or which locale you - and other debaters - think they have.
> 
> Again, doesn't matter if you change your settings from the default.

I don't think I have misunderstood anything.
 
>> However, you also say that your problem is not so much related to pages
>> written for *your* locale as it is related for pages written for users
>> of *other* locales. So how many times per year do Dutch, Spanish or
>> Norwegian  - and other non-English pages - are creating troubles for
>> you, as a English locale user? I am making an assumption: Almost never.
>> You don't read those languages, do you?
> 
> Did you miss the "travel" part?  Want to look up web pages for museums, 
> airports, etc in a non-English speaking country?  There's a good chance 
> they're not in English!

There is a very good chance, also, that only very few of the Web pages 
for such professional institutions would fail to declare their encoding.

>> This is also an expectation thing: If you visit a Russian page in a
>> legacy Cyrillic encoding, and gets mojibake because your browser
>> defaults to Latin-1, then what does it matter to you whether your
>> browser defaults to Latin-1 or UTF-8? Answer: Nothing.
> 
> Yes.  So?

So we can look away from Greek, Cyrillic, Japanese, Chinese etc etc in 
this debate. The eventually only benefit for English locale user of 
keeping WIN-1252 as the default, is that they can have a tiny number of 
fewer problems when visiting Western-European language web pages with 
their computer. (Yes, fI saw that you mention smart quotes etc below - 
so there is that reason too.) 

>> I think we should 'attack' the dominating locale first: The English
>> locale, in its different incarnations (Australian, American, UK). Thus,
>> we should turn things on the head: English users should start to expect
>> UTF-8 to be used. Because, as English users, you are more used to
>> 'mojibake' than the rest of us are: Whenever you see it, you 'know'
>> that it is because it is a foreign language you are reading.
> 
> Modulo smart quotes (and recently unicode ellipsis characters).  These 
> are actually pretty common in English text on the web nowadays, and have 
> a tendency to be in "ISO-8859-1".

If we change the default, they will start to tend to be in UTF-8.
 
>> Or, please, explain to us when and where it
>> is important that English language users living in their own, native
>> lands so to speak, need that their browser default to Latin-1 so that
>> they can correctly read English language pages?
> 
> See above.

OK: Quotation marks. However, in 'old web pages', then you also find 
much more use of HTML entities (such as “) than you find today. 
We should take advantage of that, no?
 
>> See? We would have a plan. Or what do you think?
> 
> Try it in your browser.  When I set UTF-8 as my default, there were 
> broke quotation marks all over the web for me.  And I'm talking pages in 
> English.

When you mention quotation marks, then you mention a real locale 
related issue. And may be the Euro sign too?  Nevertheless, the problem 
is smallest for languages that primarily limit their alphabet to those 
letter that are present in the American Standard Code for Information 
Interchange format. It would be logical, thus, to start the switch to 
UTF-8 for those locales (since the US-ASCII part is unchanged in UTF-8).

Perhaps we need to have a project to measure these problems, instead of 
all these anecdotes? And, yes, I have tried to use UTF-8 as default. It 
tend to work well, in my locale, I'd say. However, it also tend to work 
*differently* in *different browsers*. I think e.g. it will work 
*better* in Firefox than in Webkit. This because Webkit applies 
encoding differently when it comes to iframe pages compared to what 
Firefox does - I filed several Webkit bugs relating to that thing ... 
-- 
Leif Halvard Silli

Received on Monday, 5 December 2011 15:14:01 UTC