Re: Locale/default encoding table

Henri Sivonen On 09-10-14 15.28:

> On Oct 14, 2009, at 06:40, Leif Halvard Silli wrote:
> 
>> I especially picked the "os_RU" locale because it is situated in  
>> Russia and uses Cyrillic for everything. The ossetic alphabet seems  
>> to be fully compatible with Windows 1251.
> 
> In that case, it would probably make sense to ship Windows-1251 as the  
> default for an Ossetian localization.


Then I suppose we agree that Ian's table must not simply say that 
"For all other locales, use Windows 1252 as default", right?


>> win1252 - bn-BD  - Not Latin: Bengali Bangladesh
>> win1252 - bn-IN  – Not Latin: Benagli India
> 
> I don't have data about Bengali Web pages, but if it turns out that  
> most Bengali content is labeled but that users of Bengali-localized  
> browsers also read a lot of unlabeled English content, Windows-1252  
> would make sense as the default.

But aren't English content supported by ASCII, and thus by UTF-8?

I could understand it if you had said that they read for example 
legacy French content - like many of the Arabic users certainly 
do. However, for the Arabic locale, you have UTF-8 as default ...

What is the purpose of setting UTF-8 as the default, other than as 
an encouragement to use that encoding, if that encoding is 
detectable even without such a default?

>> UTF-8   - cy     - Win1252 doesn't fully cover Welsh
> 
> I seems very plausible that users of a Welsh browser UI read a lot of  
> English content. If it happens that Welsh content is labeled and the  
> English content is what's unlabeled, Windows-1252 would make sense as  
> the default.
> 
> This isn't about what encoding covers the language of the  
> localization. This is about what's the most common unlabeled encoding  
> that the users of a particular localization encounter.

For Croat you have set it to UTF-8. It took me only one Google 
search to find Croat content that was ISO-8859-2, but which was 
labeled as ISO-8859-1. Thus, it seems to me that the reason why 
the Slavic languages of former Yugoslavia have been set to UTF-8, 
is related to the culture they have of treating two different 
alphabets equally (from the very design of their alphabets to 
YUSCII and beyond ...) At least there seems to be more things 
involved than "the most common unlabeled encoding" for that user 
group.

As for Welsh: This is minority market. Mozilla (and Google also) 
has won market shares by allowing people to engage in localization 
work. There probably isn't a Welsh version of Internet Explorer 
(fingers crossed, hoping for the opposite). Anyway, if most 
English legacy content is supported by ASCII then why not UTF-8?

If there /are/ reasons to have UTF-8 as default, then I can very 
well understand why the Welsh localizers chose UTF-8 as default!

Are there any data on how much unlabeled English content there are 
out there that uses anything other than the ASCII repertoire? 
Doesn't most of the unlabeled English content use HTML entities 
for the "special" characters anyway?

>> Why is it safer for Welsh to use UTF-8 as default.
> 
> I rather suspect that UTF-8 isn't the best default for any locale,  
> since real UTF-8 content is unlikely to rely on the last defaulting  
> step for decoding. I don't know why some Firefox localizations default  
> to UTF-8.


So *is* there any reason to have UTF-8 as default *anywhere*, 
other than the motto "yes, let's switch to UTF-8"?

>> Also, again: I took up Belarusian. Why does it have ISO-8859-5 as  
>> default?
> 
> I filed a bug on this, FWIW. Maybe "why" is answered in the bug report  
> in due course:
> https://bugzilla.mozilla.org/show_bug.cgi?id=522218


Cool! I also wonder why you don't apply charset detection for that 
locale. (If I understood your localization files correctly.)

 
>> Do you just trust whatever comes out of Mozilla?
> 
> It would be helpful to dig up data on how Microsoft configures IE by  
> default in various locales. And Opera if Opera varies the default by  
> locale.

Indeed. But I wonder if it would be smarter to just document those 
things - including their effects, rather than saying that vendors 
and users (the text also speaks about user defined encodings) 
/should/ use those encodings.
-- 
leif halvard silli

Received on Wednesday, 14 October 2009 15:37:23 UTC