Re: Locale/default encoding table from Henri Sivonen on 2009-10-15 (public-html@w3.org from October 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 15 Oct 2009 15:55:14 +0300
To: "Phillips, Addison" <addison@amazon.com>
Cc: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Ian Hickson <ian@hixie.ch>, Geoffrey Sneddon <gsneddon@opera.com>, HTML WG <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-Id: <D71BFDE3-8470-4BB9-9DA1-544E79AFD7A2@iki.fi>

On Oct 14, 2009, at 17:18, Phillips, Addison wrote:

>>
>> I rather suspect that UTF-8 isn't the best default for any locale,
>> since real UTF-8 content is unlikely to rely on the last defaulting
>> step for decoding. I don't know why some Firefox localizations
>> default to UTF-8.
>
> Why do you assume that UTF-8 pages are better labeled than other  
> encodings?

Because most of the global browser installed base (including en-US  
browsers deployed around the world) doesn't default to UTF-8 and  
defaults to chardet off, UTF-8 doesn't work right unless labeled or  
unless user takes action.

It seems to me that unlabeled UTF-8 could only work out-of-the-box for  
two reasons:

  1) Defaulting to UTF-8 in a given locale letting authors in that  
locale be sloppy and not label their encodings. (BOM counts as a  
label.) In this scenario, it's not about an age-old legacy but the  
locale-specific default generating a new legacy. (For this reason, I  
think it's rather questionable to ship UTF-8-defaulting browsers to  
any locale.)

  2) A heuristic detector that supports UTF-8 defaulting on in the  
locale. However, the locales where a detector defaults on (Russian,  
Ukranian, Japanese), the legacy is well-known not to be predominantly  
UTF-8. (The Swedish localization of Firefox also defaults to a  
detector on by default, but that's clearly bogus.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 15 October 2009 12:58:05 UTC