W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2011

[whatwg] Default encoding to UTF-8?

From: Sergiusz Wolicki <sergiusz@wolicki.com>
Date: Mon, 5 Dec 2011 18:57:45 +0100
Message-ID: <CAALsyKos01wM+cmmn7195DAxCqhKgKQ8zvGF36n4KyVzi_rcxw@mail.gmail.com>
> (And HTML5 defines it the same.)

No. As far as I understand, HTML5 defines US-ASCII to be the default and
requires that any other encoding is explicitly declared. I do like this
approach.

We should also lobby for authoring tools (as recommended by HTML5) to
default their output to UTF-8 and make sure the encoding is declared.  As
so many pages, supposedly (I have not researched this), use the incorrect
encoding, it makes no sense to try to clean this mess by messing with
existing defaults. It may fix some pages and break others. Browsers have
the ability to override an incorrect encoding and this a reasonable
workaround.


-- Sergiusz


On Mon, Dec 5, 2011 at 6:42 PM, Leif Halvard Silli <
xn--mlform-iua at xn--mlform-iua.no> wrote:

> L. David Baron on Wed Nov 30 18:29:31 PST 2011:
> > On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote:
> >> My understanding is that all browsers* default to Western Latin
> >> (ISO-8859-1) encoding by default (for Western-world
> >> downloads/OSes) due to legacy content on the web. But how relevant
> >> is that still today? Has any browser done any recent research into
> >> the need for this?
> >
> > The default varies by localization (and within that potentially by
> > platform), and unfortunately that variation does matter.  You can
> > see Firefox's defaults here:
> >
> http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default
> > (The localization and platform are part of the filename.)
>
> Last I checked, some of those locales defaulted to UTF-8. (And HTML5
> defines it the same.) So how is that possible? Don't users of those
> locales travel as much as you do? Or do we consider the English locale
> user's as more important? Something is broken in the logics here!
>
> > I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago
> > (by changing the "intl.charset.default" preference), and I do see a
> > decent amount of broken content as a result (maybe I encounter a new
> > broken page once a week? -- though substantially more often if I'm
> > looking at non-English pages because of travel).
>
> What kind of trouble are you actually describing here? You are
> describing a problem with using UTF-8 for *your locale*. What is your
> locale? It is probably English. Or do you consider your locale to be
> 'the Western world locale'? It sounds like *that* is what Anne has in
> mind when he brings in Dutch:
> http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as
> if some see Latin-1 - or Windows-1251 as we now should say - as a
> 'super default' rather than a locale default. If that is the case, that
> it is a super default, then we should also spec it like that! Until
> further, I'll treat Latin-1 as it is specced: As a default for certain
> locales.)
>
> Since it is a locale problem, we need to understand which locale you
> have - and/or which locale you - and other debaters - think they have.
> Faruk probably uses a Spanish locale - right?, so the two of you are
> not speaking out of the same context.
>
> However, you also say that your problem is not so much related to pages
> written for *your* locale as it is related for pages written for users
> of *other* locales. So how many times per year do Dutch, Spanish or
> Norwegian  - and other non-English pages - are creating troubles for
> you, as a English locale user? I am making an assumption: Almost never.
> You don't read those languages, do you?
>
> This is also an expectation thing: If you visit a Russian page in a
> legacy Cyrillic encoding, and gets mojibake because your browser
> defaults to Latin-1, then what does it matter to you whether your
> browser defaults to Latin-1 or UTF-8? Answer: Nothing.
>
> >> I'm wondering if it might not be good to start encouraging
> >> defaulting to UTF-8, and only fallback to Western Latin if it is
> >> detected that the content is very old / served by old
> >> infrastructure or servers, etc. And of course if the content is
> >> served with an explicit encoding of Western Latin.
> >
> > The more complex the rules, the harder they are for authors to
> > understand / debug.  I wouldn't want to create rules like those.
>
> Agree that that particular idea is probably not the best.
>
> > I would, however, like to see movement towards defaulting to UTF-8:
> > the current situation makes the Web less world-wide because pages
> > that work for one user don't work for another.
> >
> > I'm just not quite sure how to get from here to there, though, since
> > such changes are likely to make users experience broken content.
>
> I think we should 'attack' the dominating locale first: The English
> locale, in its different incarnations (Australian, American, UK). Thus,
> we should turn things on the head: English users should start to expect
> UTF-8 to be used. Because, as English users, you are more used to
> 'mojibake' than the rest of us are: Whenever you see it, you 'know'
> that it is because it is a foreign language you are reading. It is we,
> the users of non-English locales, that need the default-to-legacy
> encoding behavior the most. Or, please, explain to us when and where it
> is important that English language users living in their own, native
> lands so to speak, need that their browser default to Latin-1 so that
> they can correctly read English language pages?
>
> If the English locales start defaulting to UTF-8, then little by
> little, the same expectation etc will start spreading to the other
> locales as well, not least because the 'geeks' of each locale will tend
> to see the English locale as a super default - and they might also use
> the US English locale of their OS and/or browser. We should not
> consider the needs of geeks - they will follow (read: lead) the way, so
> the fact that *they* may see mojibake, should not be a concern.
>
> See? We would have a plan. Or what do you think? Of course, we - or
> rather: the browser vendors - would need to market this as an important
> change. The HTML5 spec already justifies the use of UTF-8 several
> places - it says that pages might not work as expected e.g. w.r.t.
> URLs, unless UTF-8 is used. So there are enough of arguments that can
> be used.
>
> There are other technical ideas I have, such as treating the BOM the
> way Webkit and IE treats it - that would increase the number of pages
> treated as UTF-8 by all browsers a little bit [1]. However that can
> wait or whatever: The most important thing is to *initiate* the default
> encoding change.
>
> [1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
>
> Leif Halvard Silli
>
Received on Monday, 5 December 2011 09:57:45 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:59:38 UTC