W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2011

[whatwg] Default encoding to UTF-8?

From: Sergiusz Wolicki <sergiusz@wolicki.com>
Date: Thu, 1 Dec 2011 18:48:16 +0100
Message-ID: <CAALsyKr+GM5Kw1m=RaUG0Ene9M6VAfgDCeo-3-y51+tPc-iG_g@mail.gmail.com>
I have read section 4.2.5.5 of the WHATWG HTML spec and I think it is
sufficient.  It requires that any non-US-ASCII document has an explicit
character encoding declaration. It also recommends UTF-8 for all new
documents and for authoring tools' default encoding.  Therefore, any
document conforming to HTML5 should not pose any problem in this area.

The default encoding issue is therefore for old stuff.  But I have seen a
lot of pages, in browsers and in mail, that were tagged with one encoding
and encoded in another.  Hence, documents without a charset declaration are
only one of the reasons of garbage we see. Therefore, I see no point in
trying to fix anything in browsers by changing the ancient defaults
(risking compatibility issues). Energy should go into filing bugs against
misbehaving authoring tools and into adding proper recommendations and
education in HTML guidelines and tutorials.


Thanks,
Sergiusz


On Thu, Dec 1, 2011 at 7:00 AM, L. David Baron <dbaron at dbaron.org> wrote:

> On Thursday 2011-12-01 14:37 +0900, Mark Callow wrote:
> > On 01/12/2011 11:29, L. David Baron wrote:
> > > The default varies by localization (and within that potentially by
> > > platform), and unfortunately that variation does matter.
> > In my experience this is what causes most of the breakage. It leads
> > people to create pages that do not specify the charset encoding. The
> > page works fine in the creator's locale but shows mojibake (garbage
> > characters) for anyone in a different locale.
> >
> > If the default was ASCII everywhere then all authors would see mojibake,
> > unless it really was an ASCII-only page, which would force them to set
> > the charset encoding correctly.
>
> Sure, if the default were consistent everywhere we'd be fine.  If we
> have a choice in what that default is, UTF-8 is probably a good
> choice unless there's some advantage to another one.  But nobody's
> figured out how to get from here to there.
>
> (I think this is legacy from the pre-Unicode days, when the browser
> simply displayed Web pages using to the system character set, which
> led to a legacy of incompatible Web pages in different parts of the
> world.)
>
> -David
>
> --
> ?   L. David Baron                         http://dbaron.org/   ?
> ?   Mozilla                           http://www.mozilla.org/   ?
>
Received on Thursday, 1 December 2011 09:48:16 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:59:38 UTC