Re: HTML5 Issue 11 (encoding detection): I18N WG response... from Leif Halvard Silli on 2009-10-13 (public-i18n-core@w3.org from October to December 2009)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 13 Oct 2009 21:24:36 +0200
To: Andrew Cunningham <andrewc@vicnet.net.au>
CC: Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>, Ian Hickson <ian@hixie.ch>, Mark Davis ☕ <mark@macchiato.com>, Martin_J=2E_D=FCrst <duerst@it.aoyama.ac.jp>, "Phillips, Addison" <addison@amazon.com>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Larry Masinter <masinter@adobe.com>
Message-ID: <4AD4D3F4.5010708@xn--mlform-iua.no>

Andrew Cunningham On 09-10-12 16.12:

> Thanks Henri, greatly appreciated. Useful data.
> 
> Will be interesting to see what the trend will be in the future as the
> localisation effort builds up steam.

So Ian's table [1] is just a snapshot? And thus "re-engineering" 
is still needed?

[1] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

> although begs the question as to what happens with legacy encoded data in
> those languages, and with Vietnamese i'm still seeing bloggers using VNI,
> so still some content being produced in that encoding even today.
> 
> not surprised with russian, japanese and ukranian, since legacy data may
> be in a few differnet encodings so heuristics makes sense.

Belarusian has (according to Ian's table and Mozilla) another 
default (ISO-8859-5)  than Ukrainian and Russian (Win-1251). Can 
that be correct/optimal? I think not.

E.g. the Belarusian Departement of Forreign Affairs - 
http://www.mfa.gov.by/ - is in Russian and uses Windows 1251. 
Other pages in Belarusian pages that I found also used Windows 
1251. I found not a single page with iso-8859-5. Belarusian 
language and people interacts with Russian(s) and Ukrainian(s), so 
I think that it is more likely that the default should be windows 
1251.

And why doesn't Mozilla use charset heuristics for the Belarusian? 
(Because people use Russian localizations anyhow?) It is a wonder 
that this can work, if Mozilla operates with the wrong default 
charset!

Also interesting that all the slavic languages of Yugoslavia 
defaults to UTF-8. But logical in their bi-alphabetic situation.

That said, according to Wikipedia, the preferred encoding for 
Croatian is ISO 8859-2 or utf-8. [2]. And after a single google 
search, I found a page, which is *labeleled* as ISO-8859-1, but 
which *requires* ISO-8859-2. [3] However, I am "pro UTF-8", so it 
is OK for me to recommend UTF-8.

[2] http://en.wikipedia.org/wiki/Gaj's_Latin_alphabet#Computing
[3] http://www.ifs.hr/

> also not surprised by the indian localisations, had to be either utf-8 or
> win-1252. and guess win-1252 is a logical choice since firefox doesn't
> really support legacy encodings for Indian languages, and good percentage
> of legacy content in indian languages is misidentifying itself as
> iso-8859-1 or windows-1252 and relying on styling.

Styling? You mean, the good old "font tag considered harmful" 
effect? Is that even possible to get to work any more? I know that 
Hebrew on the Web used to apply similar tricks - I think they used 
  "the default latin encoding" and then "turned the text". But 
still, win-1252 isn't the default encoding of Hebrew?!

Do you have example pages for wrong Indian language pages?

> On Mon, October 12, 2009 23:49, Henri Sivonen wrote:
>> The Vietnamese localization of Firefox defaults to UTF-8 and no
>> heuristic detector:
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/vi/toolkit/chrome/global/intl.properties
>>
>> For comparison, Japanese, Russian and Ukranian have a heuristic
>> detector turned on by default:
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ja/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ru/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/uk/toolkit/chrome/global/intl.properties
>>
>> (Korean, Simplified Chinese and Traditional Chinese don't, BTW.)
>>
>> Query of interest:
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/find?string=global%2Fintl.properties&tree=l10n-mozilla1.9.1&hint=
>>
>> In various Indian locales, the language itself does not use the Latin
>> alphabet but the default is still Windows-1252:
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/hi-IN/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/bn-IN/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/gu-IN/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/pa-IN/toolkit/chrome/global/intl.properties

-- 
leif halvard silli

Received on Tuesday, 13 October 2009 19:25:23 UTC