Re: Locale/default encoding table from Leif Halvard Silli on 2009-10-14 (public-i18n-core@w3.org from October to December 2009)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 14 Oct 2009 22:50:09 +0200
To: Larry Masinter <masinter@adobe.com>
CC: "Phillips, Addison" <addison@amazon.com>, Henri Sivonen <hsivonen@iki.fi>, Ian Hickson <ian@hixie.ch>, Geoffrey Sneddon <gsneddon@opera.com>, HTML WG <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4AD63981.8060608@xn--mlform-iua.no>

Larry Masinter On 09-10-14 21.22:

> I think the latest editor's draft does a good job of
> describing the tables for default encoding as 
> suggestions rather than normative requirements. 


Yes, it uses the word "suggest".

 
> I think this is appropriate; there is no normative
> requirement to support any charset other than UTF8
> and ISO-8859-1/Win-1252, so normatively requiring
> a more complex auto-detection of charsets not 
> supported doesn't make a lot of sense.


I thought the Addison and Mark's proposed required UTF-8 
auto-detection would only check for UTF-8?

 
> The idea that you might reasonably guess the
> charset of retrieved HTML by looking at the locale
> of the browser doing the guessing, well, it is
> a very weak and not particularly accurate heuristic.


In my mother tongue, the word for "default" very often seems to be 
"automatic". However, a "default" is what we - automatically - get 
when the automatics either are lacking  or has been tried.

Ian's algorithm just tells when in the detection process it's time 
to give up  - to default. I think no one proposed that UAs should 
be required to do any guessing w.r.t. legacy encoding. Instead, we 
talked about which encoding default a browser for a particular 
locale should ship with and how accurate Ian's table is.

A required UTF-8 auto-detection would however allow us to separate 
the concerns better when deciding for the default encoding. If it 
is reliable, then it could probably perhaps allow us to say - as 
Henri suggested - no locale (with legacy encodings) should use 
UTF-8 as the encoding default.

 
> And in situations where different browsers will have
> different configuration information, the "advantage"
> that multiple browsers behave similarly isn't very
> strong anyway.

But if we have a table, it should be as correct as possible. And 
not simply "suggest" Win1251 for "all other locales" just like 
that. That is not simply a suggestion, but a postulate.

> Larry


Leif

> --
> http://larry.masinter.net
> 
> 
> -----Original Message-----
> From: public-html-request@w3.org [mailto:public-html-request@w3.org] On Behalf Of Leif Halvard Silli
> Sent: Wednesday, October 14, 2009 10:24 AM
> To: Phillips, Addison
> Cc: Henri Sivonen; Ian Hickson; Geoffrey Sneddon; HTML WG; public-i18n-core@w3.org
> Subject: Re: Locale/default encoding table
> 
> Phillips, Addison On 09-10-14 16.18:
> 
>>> I rather suspect that UTF-8 isn't the best default for any
>>> locale, since real UTF-8 content is unlikely to rely on the
>>> last defaulting step for decoding. I don't know why some
>>> Firefox localizations default to UTF-8.
>> Why do you assume that UTF-8 pages are better labeled than
>> other encodings? Experience suggests otherwise :-).
>>
>> Although UTF-8 is positively detectable and several of us (Mark
>> Davis and I, at least) have suggested making UTF-8
>> auto-detection a requirement, in fact, unless chardet is used,
>> nothing causes unannounced UTF-8 to work any better than any
>> other encoding.
> 
> The effect of a UTF-8 auto-detection requirement would lead to two 
> defaults: UTF-8 as one default. And legacy encodings as a 
> secondary default.
> 
> This sounds like an excellent idea.
> 
> This would - I suppose - make it not needed to operate with UTF-8 
> as default for any locale for which there exist legacy encodings.
> 
> This, in turn, would allow us to be more accurate in picking the 
> default legacy ncoding(s). E.g. for Croat, it would not be 
> necessary to have UTF-8 as default legacy fallback, I suppose.
> 
>> The I18N WG pointed out that for many developing languages and
>> locales, the legacy encodings are fragmented and frequently
>> font-based, making UTF-8 a better default choice. This is not
>> the case for a relatively well-known language such as
>> Belarusian or Welsh, but it is the case for many minority and
>> developing world languages.
> 
> Indeed.

Received on Wednesday, 14 October 2009 20:50:48 UTC