W3C home > Mailing lists > Public > public-i18n-core@w3.org > October to December 2009

Re: HTML5 Issue 11 (encoding detection): I18N WG response...

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 13 Oct 2009 21:24:36 +0200
Message-ID: <4AD4D3F4.5010708@xn--mlform-iua.no>
To: Andrew Cunningham <andrewc@vicnet.net.au>
CC: Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>, Ian Hickson <ian@hixie.ch>, Mark Davis ☕ <mark@macchiato.com>, Martin_J=2E_D=FCrst <duerst@it.aoyama.ac.jp>, "Phillips, Addison" <addison@amazon.com>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Larry Masinter <masinter@adobe.com>
Andrew Cunningham On 09-10-12 16.12:

> Thanks Henri, greatly appreciated. Useful data.
> Will be interesting to see what the trend will be in the future as the
> localisation effort builds up steam.

So Ian's table [1] is just a snapshot? And thus "re-engineering" 
is still needed?


> although begs the question as to what happens with legacy encoded data in
> those languages, and with Vietnamese i'm still seeing bloggers using VNI,
> so still some content being produced in that encoding even today.
> not surprised with russian, japanese and ukranian, since legacy data may
> be in a few differnet encodings so heuristics makes sense.

Belarusian has (according to Ian's table and Mozilla) another 
default (ISO-8859-5)  than Ukrainian and Russian (Win-1251). Can 
that be correct/optimal? I think not.

E.g. the Belarusian Departement of Forreign Affairs - 
http://www.mfa.gov.by/ - is in Russian and uses Windows 1251. 
Other pages in Belarusian pages that I found also used Windows 
1251. I found not a single page with iso-8859-5. Belarusian 
language and people interacts with Russian(s) and Ukrainian(s), so 
I think that it is more likely that the default should be windows 

And why doesn't Mozilla use charset heuristics for the Belarusian? 
(Because people use Russian localizations anyhow?) It is a wonder 
that this can work, if Mozilla operates with the wrong default 

Also interesting that all the slavic languages of Yugoslavia 
defaults to UTF-8. But logical in their bi-alphabetic situation.

That said, according to Wikipedia, the preferred encoding for 
Croatian is ISO 8859-2 or utf-8. [2]. And after a single google 
search, I found a page, which is *labeleled* as ISO-8859-1, but 
which *requires* ISO-8859-2. [3] However, I am "pro UTF-8", so it 
is OK for me to recommend UTF-8.

[2] http://en.wikipedia.org/wiki/Gaj's_Latin_alphabet#Computing
[3] http://www.ifs.hr/

> also not surprised by the indian localisations, had to be either utf-8 or
> win-1252. and guess win-1252 is a logical choice since firefox doesn't
> really support legacy encodings for Indian languages, and good percentage
> of legacy content in indian languages is misidentifying itself as
> iso-8859-1 or windows-1252 and relying on styling.

Styling? You mean, the good old "font tag considered harmful" 
effect? Is that even possible to get to work any more? I know that 
Hebrew on the Web used to apply similar tricks - I think they used 
  "the default latin encoding" and then "turned the text". But 
still, win-1252 isn't the default encoding of Hebrew?!

Do you have example pages for wrong Indian language pages?

> On Mon, October 12, 2009 23:49, Henri Sivonen wrote:
>> The Vietnamese localization of Firefox defaults to UTF-8 and no
>> heuristic detector:
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/vi/toolkit/chrome/global/intl.properties
>> For comparison, Japanese, Russian and Ukranian have a heuristic
>> detector turned on by default:
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ja/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ru/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/uk/toolkit/chrome/global/intl.properties
>> (Korean, Simplified Chinese and Traditional Chinese don't, BTW.)
>> Query of interest:
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/find?string=global%2Fintl.properties&tree=l10n-mozilla1.9.1&hint=
>> In various Indian locales, the language itself does not use the Latin
>> alphabet but the default is still Windows-1252:
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/hi-IN/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/bn-IN/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/gu-IN/toolkit/chrome/global/intl.properties
>> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/pa-IN/toolkit/chrome/global/intl.properties

leif halvard silli
Received on Tuesday, 13 October 2009 19:25:23 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:23:05 UTC