[whatwg] Default encoding to UTF-8? from Henri Sivonen on 2011-12-02 (public-whatwg-archive@w3.org from December 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 2 Dec 2011 17:46:02 +0200
Message-ID: <CAJQvAucYTL+A2z6EA-3hpmSe5sCOE4wCFw39XaVN8Y_MYGcy_w@mail.gmail.com>
On Thu, Dec 1, 2011 at 1:28 AM, Faruk Ates <farukates at me.com> wrote:
> My understanding is that all browsers* default to Western Latin (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due to legacy content on the web.

As has already been pointed out, the default depends varies by locale.

> But how relevant is that still today?

It's relevant for supporting the long tail of existing content. The
sad part is that the mechanisms that allows existing legacy content to
work within each locale silo also makes it possible for ill-informed
or uncaring authors to develop more locale-siloed content (i.e.
content that doesn't declare the encoding and, therefore, only works
when the user's fallback encoding is the same as the author's).

> I'm wondering if it might not be good to start encouraging defaulting to UTF-8, and only fallback to Western Latin if it is detected that the content is very old / served by old infrastructure or servers, etc. And of course if the content is served with an explicit encoding of Western Latin.

I think this would be a very bad idea. It would make debugging hard.
Moreover, it would be the wrong heuristic, because well-maintained
server infrastructure can host a lot of legacy content. Consider any
shared hosting situation where the administrator of the server
software isn't the content creator.

> We like to think that ?every web developer is surely building things in UTF-8 nowadays? but this is far from true. I still frequently break websites and webapps simply by entering my name (Faruk Ate?).

For things to work, the server-side component needs to deal with what
gets sent to it. ASCII-oriented authors could still mishandle all
non-ASCII even if Web browsers "forced" them to deal with UTF-8 by
sending them UTF-8.

Furthermore, your proposed solution wouldn't work for legacy software
that correctly declares an encoding but declared a non-UTF-8 encoding.

Sadly, getting sites to deal with your name properly requires the
developer of each site to get a clue. :-( Just sending form
submissions in UTF-8 isn't enough if the recipient can't deal. Compare
with http://krijnhoetmer.nl/irc-logs/whatwg/20110906#l-392

> Yes, I understand that that particular issue is something we ought to fix through evangelism, but I think that WHATWG/browser vendors can help with this while at the same time (rightly, smartly) making the case that the web of tomorrow should be a UTF-8 (and 16) based one, not a smorgasbord of different encodings.

Anne has worked on speccing what exactly the smorgasbord should be.
See http://wiki.whatwg.org/wiki/Web_Encodings . I think it's not
realistic to drop encodings that are on the list of encodings you see
in the encoding menu on http://validator.nu/?charset However, I think
browsers should drop support for encodings that aren't already
supported by all the major browsers, because such encodings only serve
to enable browser-specific content and encoding proliferation.

Regarding your "(and 16)" remark, considering my personal happiness at
work, I'd prioritize the eradication of UTF-16 as an interchange
encoding much higher than eradicating ASCII-based non-UTF-8 encodings
that all major browsers support. I think suggesting a solution to the
encoding problem while implying that UTF-16 is not a problem isn't
particularly appropriate. :-)

> So hence my question whether any vendor has done any recent research in this. Mobile browsers seem to have followed desktop browsers in this; perhaps this topic was tested and researched in recent times as part of that, but I couldn't find any such data. The only real relevant thread of discussion around UTF-8 as a default was this one about Web Workers:
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-September/023197.html
>
> ?which basically suggested that everyone is hugely in favor of UTF-8 and making it a default wherever possible.
>
> So how 'bout it?

I think in order to comply with the Support Existing Content design
principle (even if it unfortunately means that support is siloed by
locale) and in order to make plans that are game theoretically
reasonable (not taking steps that make users migrate to browsers that
haven't taken the steps), I think we shouldn't change the fallback
encodings from what the HTML5 spec says when it comes to loading
text/html or text/plain content into a browsing context.

> What's going in this area, if anything?

There's the effort to specify a set of encodings and their aliases for
browsers to support. That's moving slowly, since Anne has other more
important specs to work on.

Other than that, there have been efforts to limit new features to
UTF-8 only (consider scripts in Workers and App Cache manifests) and
efforts to make new features not vary by locale-dependent defaults
(consider HTML in XHR). Both these efforts have faced criticism,
unfortunately.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
Received on Friday, 2 December 2011 07:46:02 UTC