Re: Locale/default encoding table from Ian Hickson on 2009-10-14 (public-html@w3.org from October 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 14 Oct 2009 05:50:42 +0000 (UTC)
To: Andrew Cunningham <andrewc@vicnet.net.au>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, "Phillips, Addison" <addison@amazon.com>
Cc: Geoffrey Sneddon <gsneddon@opera.com>, HTML WG <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>, Mark Davis ☕ <mark@macchiato.com>, Martin_J=2E_D=FCrst <duerst@it.aoyama.ac.jp>, Richard Ishida <ishida@w3.org>, Larry Masinter <masinter@adobe.com>
Message-ID: <Pine.LNX.4.62.0910140532500.3716@hixie.dreamhostps.com>

On Wed, 14 Oct 2009, Andrew Cunningham wrote:
>
> There seems to be two fundamentally different approaches to fall back, 
> when basing fall back on UI of language
> 
> Selected a legacy encoding that fully supports the language, if the user 
> agent does not support an appropriate encoding,
> 
> 1) use UTF-8 as the fall back.
> 
> 2) base selection of fall back legacy encoding on another language 
> widely used by target user group, i.e. if language is a non-national 
> language, select a national language and use that to choose the fall 
> back legacy encoding.
> 
> Although the second approach in some cases can draw ua developers into 
> political disputes.

As far as I can tell there is only one approach that works, and that is 
setting the default to be whatever encoding is used by the majority of 
unlabeled documents read by the product's intended users.


On Wed, 14 Oct 2009, Leif Halvard Silli wrote:
> 
> So where does Windows 1252 as default for Bengali, Tamil etc fit in 
> here?

At a guess, pages in those languages are mostly correctly labeled or 
correctly autodetected, and so the fallback is unnecessary; or the users 
use more pages from "Western European" languages (as you put it) than 
their own. Or, of course, the default Mozilla uses could be wrong.


On Tue, 13 Oct 2009, Phillips, Addison wrote:
>
> I'm still pretty sure that a table is not the right solution here.
> 
> The text the I18N WG proposed allows the current behavior, which is all 
> that is necessary on a normative level. It uses examples instead of 
> normative language. I'm completely mystified as to why Ian won't discuss 
> that text directly.

Which text?

If you mean the text proposed here:

   http://lists.w3.org/Archives/Public/public-html/2009Aug/1040.html

...then I discussed it here:

   http://lists.w3.org/Archives/Public/public-html/2009Oct/0281.html


> My concern with providing a table is that it preserves, essentially 
> forever, the behavior of browsers in the past.

The behaviour will be preserved whether the spec admits it or not. I see 
no reason to sweep it under the carpet just because we wish the world was 
different.


> Character encoding distribution is and historically has been evolving. 
> As recently as eight years ago, most browsers did not support proper 
> display of UTF-8. Today, the most common encoding on the Web *is* UTF-8. 
> The localization choices of current vendors--whether well- or 
> ill-conceived--should not necessarily be *normative* guidance embedded 
> in the HTML5 spec for future generations of browser vendors.

This issue is not about what the most common encoding might be. This issue 
is about what the most common encoding *in unlabeled content* is.


> I think that having a table like this is useful information. But it 
> should be "backwards pointing" and separate from HTML. I'd point out: 
> the I18N WG hosts any number of pages documenting information such as 
> this about browsers. I think we'd be very happy to add this to the 
> collection. It could even be referenced from HTML5. Just don't make it 
> part of the spec... because I know many developers who follow exactly 
> what the spec says. And this is *not* appropriate in this case because 
> the encoding environment is still evolving and because many locales have 
> been disadvantaged in the past.

If by developers you mean authors, the spec is very clear that UTF-8 is 
the only recommendation. The table in question, indeed the entire section 
within which the table is found, is in fact not even included in the 
author version of the spec.


On Wed, 14 Oct 2009, Leif Halvard Silli wrote:
> > 
> > For instance, if a particular locale has been using browsers built for 
> > a similar but not identical locale, then it is likely that the content 
> > written by authors in that locale will actually depend on the default 
> > encoding of the legacy surrogate locale. There are a number of 
> > examples of this in the Mozilla localisations (Henri pointed to a few 
> > of them).
> 
> I think I read everything in this thread. Did not see his examples. 
> Where?

http://lists.w3.org/Archives/Public/public-html/2009Oct/0339.html


> > It basically depends on what Ossetian users have been using before 
> > having a dedicated localised product.
> 
> And they are using Russian today. They probably use Ossetian also. They 
> just don't have a localized Firefox browser.

Then they are in the 'ru' locale for the purposes of that table.


> Also: Above you talked about legacy surrogate locales that are similar 
> but not identical. By "similar" you of course at least have in mind 
> "same script". So, could explain me why browsers must have the following 
> defaults?
> 
> Default - Locale - Script
> --------|--------|----------------
> win1252 - bn-BD  - Not Latin: Bengali Bangladesh
> win1252 - bn-IN  � Not Latin: Benagli India
> win1252 - el     � Not Latin: Greek
> win1252 - eo     � Win1252 doesn't fully cover Esperanto
> win1252 - mn     � Not Latin: 90% cyrillic users
> win1252 - mr     � Not Latin: Deva script
> win1252 - or     � Not Latin: Orya script
> win1252 - ta     � Not Latin: Tamil script
> win1252 - ta-LK  � Not Latin: (Tamil script?)
> UTF-8   - cy     - Win1252 doesn't fully cover Welsh

All the data in that table is directly derived from the Firefox 
localisations. I do not claim it makes any sense. I only claim that it is 
what is deployed. If there is more reliable data, e.g. IE's deployed 
defaults, or data derived from a study of unlabeled pages most commonly 
visited by users from various locales, I would be more than happy to 
adjust the table accordingly.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 14 October 2009 05:39:57 UTC