Re: Locale/default encoding table from Leif Halvard Silli on 2009-10-14 (public-html@w3.org from October 2009)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 14 Oct 2009 20:37:57 +0200
To: Ian Hickson <ian@hixie.ch>
CC: Andrew Cunningham <andrewc@vicnet.net.au>, "Phillips, Addison" <addison@amazon.com>, Geoffrey Sneddon <gsneddon@opera.com>, HTML WG <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>, Mark Davis ☕ <mark@macchiato.com>, Martin_J=2E_D=FCrst <duerst@it.aoyama.ac.jp>, Richard Ishida <ishida@w3.org>, Larry Masinter <masinter@adobe.com>
Message-ID: <4AD61A85.7000304@xn--mlform-iua.no>

Ian Hickson On 09-10-14 07.50:

> On Wed, 14 Oct 2009, Andrew Cunningham wrote:
>> There seems to be two fundamentally different approaches to fall back, 

>> 1) use UTF-8 as the fall back.
>>
>> 2) base selection of fall back legacy encoding on another language 
>> widely used by target user group, i.e. if language is a non-national 
>> language, select a national language and use that to choose the fall 
>> back legacy encoding.

> As far as I can tell there is only one approach that works, and that is 
> setting the default to be whatever encoding is used by the majority of 
> unlabeled documents read by the product's intended users.

Like Addison said: the majority of unlabeled documents for a user 
group is typically equal to the majority of labeled documents for 
that user group.

> On Wed, 14 Oct 2009, Leif Halvard Silli wrote:
>> So where does Windows 1252 as default for Bengali, Tamil etc fit in 
>> here?
> 
> At a guess, pages in those languages are mostly correctly labeled or 
> correctly autodetected, and so the fallback is unnecessary; 

If "unnecessary", then why default to Windows 1252?

> or the users 
> use more pages from "Western European" languages (as you put it) than 
> their own. 

Select a Russian, Hindi or Hebrew encoding and go to www.CNN.com. 
It reads just fine! Even the word Beyoncé is readable. (Authors 
apparently never quite learned to stop using HTML entities.) You 
can go to English sites in Russia or India or Israel and have 
similar experiences. Perhaps a quote mark here and there will 
fail. Usually not much more.

The real problems are if the common, "legacy" orthography 
(consider " and ' versus ‘’ and “” on the Web versus on paper) for 
your locale needs more than ASCII *and* is covered by Windows 
1252. *Then* it seems reasonable with Win 1252 as fallback.

Win 1252 as fallback for English locales isn't needed for the same 
fundamental reasons as for us non-English non-ASCII Win-1252-ers.

So if your locale is non-English, non-French but belongs to the 
Francophony, like some Arabic countries and the legacy content in 
Arabic is close to not existing (I don't know the status), then 
Win-1252 may be a useful fallback.

But if your locale is non-English, yet belongs to the 
"Anglophonic" countries - such as India, then it doesn't seems 
obvious to select Win-1252 as fallback. It seems obvious to 
consider the effect on the locale native language first since the 
basic English needs are covered for by the ASCII charset anyhow.

> Or, of course, the default Mozilla uses could be wrong.

It doesn't need to be wrong or right. Just not optimal. Optimal 
for a purpose. What purpose? One of the things we have to count in 
is that within a locale there may be sub groups which are not 
covered by the default encoding of that locale.

If the default is a "wide" default, such as UTF-8, then it is 
easier for minorites to live out their language.

But if UTF-8 is auto detect the way Addison and Mark suggested 
would be required, then we can easier care for all parties, as 
this seems to me to give us two defaults: UTF-8 and legacy encoding.
-- 
leif halvard silli

Received on Wednesday, 14 October 2009 18:38:44 UTC