Re: HTML5 Issue 11 (encoding detection): I18N WG response...

Andrew Cunningham On 09-10-12 16.12:

> Thanks Henri, greatly appreciated. Useful data.
> Will be interesting to see what the trend will be in the future as the
> localisation effort builds up steam.

So Ian's table [1] is just a snapshot? And thus "re-engineering" 
is still needed?


> although begs the question as to what happens with legacy encoded data in
> those languages, and with Vietnamese i'm still seeing bloggers using VNI,
> so still some content being produced in that encoding even today.
> not surprised with russian, japanese and ukranian, since legacy data may
> be in a few differnet encodings so heuristics makes sense.

Belarusian has (according to Ian's table and Mozilla) another 
default (ISO-8859-5)  than Ukrainian and Russian (Win-1251). Can 
that be correct/optimal? I think not.

E.g. the Belarusian Departement of Forreign Affairs - - is in Russian and uses Windows 1251. 
Other pages in Belarusian pages that I found also used Windows 
1251. I found not a single page with iso-8859-5. Belarusian 
language and people interacts with Russian(s) and Ukrainian(s), so 
I think that it is more likely that the default should be windows 

And why doesn't Mozilla use charset heuristics for the Belarusian? 
(Because people use Russian localizations anyhow?) It is a wonder 
that this can work, if Mozilla operates with the wrong default 

Also interesting that all the slavic languages of Yugoslavia 
defaults to UTF-8. But logical in their bi-alphabetic situation.

That said, according to Wikipedia, the preferred encoding for 
Croatian is ISO 8859-2 or utf-8. [2]. And after a single google 
search, I found a page, which is *labeleled* as ISO-8859-1, but 
which *requires* ISO-8859-2. [3] However, I am "pro UTF-8", so it 
is OK for me to recommend UTF-8.


> also not surprised by the indian localisations, had to be either utf-8 or
> win-1252. and guess win-1252 is a logical choice since firefox doesn't
> really support legacy encodings for Indian languages, and good percentage
> of legacy content in indian languages is misidentifying itself as
> iso-8859-1 or windows-1252 and relying on styling.

Styling? You mean, the good old "font tag considered harmful" 
effect? Is that even possible to get to work any more? I know that 
Hebrew on the Web used to apply similar tricks - I think they used 
  "the default latin encoding" and then "turned the text". But 
still, win-1252 isn't the default encoding of Hebrew?!

Do you have example pages for wrong Indian language pages?

> On Mon, October 12, 2009 23:49, Henri Sivonen wrote:
>> The Vietnamese localization of Firefox defaults to UTF-8 and no
>> heuristic detector:
>> For comparison, Japanese, Russian and Ukranian have a heuristic
>> detector turned on by default:
>> (Korean, Simplified Chinese and Traditional Chinese don't, BTW.)
>> Query of interest:
>> In various Indian locales, the language itself does not use the Latin
>> alphabet but the default is still Windows-1252:

leif halvard silli

Received on Tuesday, 13 October 2009 19:25:24 UTC