- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Tue, 13 Oct 2009 21:24:36 +0200
- To: Andrew Cunningham <andrewc@vicnet.net.au>
- CC: Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>, Ian Hickson <ian@hixie.ch>, Mark Davis ☕ <mark@macchiato.com>, Martin_J=2E_D=FCrst <duerst@it.aoyama.ac.jp>, "Phillips, Addison" <addison@amazon.com>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Larry Masinter <masinter@adobe.com>
Andrew Cunningham On 09-10-12 16.12: > Thanks Henri, greatly appreciated. Useful data. > > Will be interesting to see what the trend will be in the future as the > localisation effort builds up steam. So Ian's table [1] is just a snapshot? And thus "re-engineering" is still needed? [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding > although begs the question as to what happens with legacy encoded data in > those languages, and with Vietnamese i'm still seeing bloggers using VNI, > so still some content being produced in that encoding even today. > > not surprised with russian, japanese and ukranian, since legacy data may > be in a few differnet encodings so heuristics makes sense. Belarusian has (according to Ian's table and Mozilla) another default (ISO-8859-5) than Ukrainian and Russian (Win-1251). Can that be correct/optimal? I think not. E.g. the Belarusian Departement of Forreign Affairs - http://www.mfa.gov.by/ - is in Russian and uses Windows 1251. Other pages in Belarusian pages that I found also used Windows 1251. I found not a single page with iso-8859-5. Belarusian language and people interacts with Russian(s) and Ukrainian(s), so I think that it is more likely that the default should be windows 1251. And why doesn't Mozilla use charset heuristics for the Belarusian? (Because people use Russian localizations anyhow?) It is a wonder that this can work, if Mozilla operates with the wrong default charset! Also interesting that all the slavic languages of Yugoslavia defaults to UTF-8. But logical in their bi-alphabetic situation. That said, according to Wikipedia, the preferred encoding for Croatian is ISO 8859-2 or utf-8. [2]. And after a single google search, I found a page, which is *labeleled* as ISO-8859-1, but which *requires* ISO-8859-2. [3] However, I am "pro UTF-8", so it is OK for me to recommend UTF-8. [2] http://en.wikipedia.org/wiki/Gaj's_Latin_alphabet#Computing [3] http://www.ifs.hr/ > also not surprised by the indian localisations, had to be either utf-8 or > win-1252. and guess win-1252 is a logical choice since firefox doesn't > really support legacy encodings for Indian languages, and good percentage > of legacy content in indian languages is misidentifying itself as > iso-8859-1 or windows-1252 and relying on styling. Styling? You mean, the good old "font tag considered harmful" effect? Is that even possible to get to work any more? I know that Hebrew on the Web used to apply similar tricks - I think they used "the default latin encoding" and then "turned the text". But still, win-1252 isn't the default encoding of Hebrew?! Do you have example pages for wrong Indian language pages? > On Mon, October 12, 2009 23:49, Henri Sivonen wrote: >> The Vietnamese localization of Firefox defaults to UTF-8 and no >> heuristic detector: >> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/vi/toolkit/chrome/global/intl.properties >> >> For comparison, Japanese, Russian and Ukranian have a heuristic >> detector turned on by default: >> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ja/toolkit/chrome/global/intl.properties >> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/ru/toolkit/chrome/global/intl.properties >> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/uk/toolkit/chrome/global/intl.properties >> >> (Korean, Simplified Chinese and Traditional Chinese don't, BTW.) >> >> Query of interest: >> http://mxr.mozilla.org/l10n-mozilla1.9.1/find?string=global%2Fintl.properties&tree=l10n-mozilla1.9.1&hint= >> >> In various Indian locales, the language itself does not use the Latin >> alphabet but the default is still Windows-1252: >> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/hi-IN/toolkit/chrome/global/intl.properties >> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/bn-IN/toolkit/chrome/global/intl.properties >> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/gu-IN/toolkit/chrome/global/intl.properties >> http://mxr.mozilla.org/l10n-mozilla1.9.1/source/pa-IN/toolkit/chrome/global/intl.properties -- leif halvard silli
Received on Tuesday, 13 October 2009 19:25:23 UTC