RE: Locale/default encoding table from Larry Masinter on 2009-10-14 (public-html@w3.org from October 2009)

From: Larry Masinter <masinter@adobe.com>
Date: Wed, 14 Oct 2009 12:22:27 -0700
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, "Phillips, Addison" <addison@amazon.com>
CC: Henri Sivonen <hsivonen@iki.fi>, Ian Hickson <ian@hixie.ch>, Geoffrey Sneddon <gsneddon@opera.com>, HTML WG <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <8B62A039C620904E92F1233570534C9B0118DC469D91@nambx04.corp.adobe.com>

I think the latest editor's draft does a good job of
describing the tables for default encoding as 
suggestions rather than normative requirements. 

I think this is appropriate; there is no normative
requirement to support any charset other than UTF8
and ISO-8859-1/Win-1252, so normatively requiring
a more complex auto-detection of charsets not 
supported doesn't make a lot of sense.

The idea that you might reasonably guess the
charset of retrieved HTML by looking at the locale
of the browser doing the guessing, well, it is
a very weak and not particularly accurate heuristic.

And in situations where different browsers will have
different configuration information, the "advantage"
that multiple browsers behave similarly isn't very
strong anyway.

Larry
--
http://larry.masinter.net



-----Original Message-----
From: public-html-request@w3.org [mailto:public-html-request@w3.org] On Behalf Of Leif Halvard Silli
Sent: Wednesday, October 14, 2009 10:24 AM
To: Phillips, Addison
Cc: Henri Sivonen; Ian Hickson; Geoffrey Sneddon; HTML WG; public-i18n-core@w3.org
Subject: Re: Locale/default encoding table

Phillips, Addison On 09-10-14 16.18:

>> I rather suspect that UTF-8 isn't the best default for any
>> locale, since real UTF-8 content is unlikely to rely on the
>> last defaulting step for decoding. I don't know why some
>> Firefox localizations default to UTF-8.
> 
> Why do you assume that UTF-8 pages are better labeled than
> other encodings? Experience suggests otherwise :-).
> 
> Although UTF-8 is positively detectable and several of us (Mark
> Davis and I, at least) have suggested making UTF-8
> auto-detection a requirement, in fact, unless chardet is used,
> nothing causes unannounced UTF-8 to work any better than any
> other encoding.

The effect of a UTF-8 auto-detection requirement would lead to two 
defaults: UTF-8 as one default. And legacy encodings as a 
secondary default.

This sounds like an excellent idea.

This would - I suppose - make it not needed to operate with UTF-8 
as default for any locale for which there exist legacy encodings.

This, in turn, would allow us to be more accurate in picking the 
default legacy ncoding(s). E.g. for Croat, it would not be 
necessary to have UTF-8 as default legacy fallback, I suppose.

> The I18N WG pointed out that for many developing languages and
> locales, the legacy encodings are fragmented and frequently
> font-based, making UTF-8 a better default choice. This is not
> the case for a relatively well-known language such as
> Belarusian or Welsh, but it is the case for many minority and
> developing world languages.

Indeed.
-- 
leif halvard silli

Received on Wednesday, 14 October 2009 19:23:27 UTC