RE: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization from Phillips, Addison on 2013-12-20 (www-international@w3.org from October to December 2013)

From: Phillips, Addison <addison@lab126.com>
Date: Fri, 20 Dec 2013 15:48:28 +0000
To: Henri Sivonen <hsivonen@hsivonen.fi>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <7C0AF84C6D560544A17DDDEB68A9DFB517BF46E4@ex10-mbx-36009.ant.amazon.com>

> Specifically, it's inspired by this case:
> Suppose you are in Hong Kong or Taiwan and run the Traditional Chinese
> localization of Firefox. This means that the fallback encoding (currently without
> this feature I'm proposing) is Big5, which makes sense for unlabeled legacy
> Traditional Chinese content. However, this fallback fails if you want to read
> Simplified Chinese legacy content published in the PRC. However, to the extent
> that content is published under the .cn TLD, you'd have a better experience if
> the .cn TLD made your browser apply the guess that the Simplified Chinese
> localization currently.

I think this is a good idea. It's at least useful to look at the TLD as compared to having a fixed "guess".
> >
> > * When I filed a bug to get Webkit to do some UTF-8 sniffing, I was
> > told, as negative thing, that users would then rely on it (instead of
> > label their code as UTF-8) and that this could decrease
> > interoperability.
> 
> This proposal never ends up guessing UTF-8, so this never gives anyone a
> reason not to declare UTF-8.

While I tend to agree that declaring the encoding (any encoding) should be encouraged, I find it somehow strange that the one encoding that can be pretty reliably detected from its bits and which we want to encourage all pages to use is the one encoding we DON'T detect or guess? Shouldn't getting the page right be more important than punishing the authors?

Addison

Received on Friday, 20 December 2013 15:49:02 UTC