- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 8 Dec 2011 23:33:54 +0100
Henri Sivonen Tue Dec 6 23:45:11 PST 2011: > On Mon, Dec 5, 2011 at 7:42 PM, Leif Halvard Silli wrote: > Mozilla grants localizers a lot of latitude here. The defaults you see > are not carefully chosen by a committee of encoding strategists doing > whole-Web optimization at Mozilla. We could use such a committee for the Web! > They are chosen by individual > localizers. Looking at which locales default to UTF-8, I think the > most probable explanation is that the localizers mistakenly tried to > pick an encoding that fits the language of the localization instead of > picking an encoding that's the most successful at decoding unlabeled > pages most likely read by users of the localization These localizations are nevertheless live tests. If we want to move more firmly in the direction of UTF-8, one could ask users of those 'live tests' about their experience. > (which means > *other-language* pages when the language of the localization doesn't > have a pre-UTF-8 legacy). Do you have any concrete examples? And are there user complaints? The Serb localization uses UTF-8. The Croat uses Win-1252, but only on Windows and Mac: On Linux it appears to use UTF-8, if I read the HG repository correctly. As for Croat and Window-1252, then it does not even support the Croat alphabet (in full) - I think about the digraphs. But I'm not sure about the pre-UTF-8 legacy for Croatian. Some language communities in Russia have a similar minority situation as Serb Cyrillic, only that their minority script is Latin: They use Cyrillic but they may also use Latin. But in Russia, Cyrillic dominates. Hence it seems to be the case - according to my earlier findings, that those few letters that, per each language, do not occur in Window-1251, are inserted as NCRs (that is: when UTF-8 is not used). That way, WIN-1251 can be used for Latin with non-ASCII inside. But given that Croat defaults to WIn-1252, they could in theory just use NCRs too ... Btw, for Safari on Mac, I'm unable to see any effect of switching locale: Always Win-1252 (Latin) - it used to have effect before ... But may be there is an parameter I'm unaware of - like Apple's knowledge of where in the World I live ... > I think that defaulting to UTF-8 is always a bug, because at the time > these localizations were launched, there should have been no unlabeled > UTF-8 legacy, because up until these locales were launched, no > browsers defaulted to UTF-8 (broadly speaking). I think defaulting to > UTF-8 is harmful, because it makes it possible for locale-siloed > unlabeled UTF-8 content come to existence The current legacy encodings nevertheless creates siloed pages already. I'm also not sure that it would be a problem with such a UTF-8 silo: UTF-8 is possible to detect, for browsers - Chrome seems to perform more such detection than other browsers. Today, perhaps especially for English users, it happens all the time that a page - without notice - defaults with regard to encoding - and this causes the browser - when used as an authoring tool - defaults to Windows-1252: http://twitter.com/#!/komputist/status/144834229610614784 (I suppose he used that browser based spec authoring tool that is in development.) In another message you suggested I 'lobby' against authoring tools. OK. But the browser is also an authoring tool. So how can we have authors output UTF-8, by default, without changing the parsing default? > (instead of guiding all Web > authors always to declare their use of UTF-8 so that the content works > with all browser locale configurations). On must guide authors to do this regardless. > I have tried to lobby internally at Mozilla for stricter localizer > oversight here but have failed. (I'm particularly worried about > localizers turning the heuristic detector on by default for their > locale when it's not absolutely needed, because that's actually > performance-sensitive and less likely to be corrected by the user. > Therefore, turning the heuristic detector on may do performance > reputation damage. ) W.r.t. heuristic detector: Testing the default encoding behaviour for Firefox was difficult. But in the end I understood that I must delete the cached version of the Profile folder - only then would the encodings 'fall back' properly. But before I came thus far, I tried with the e.g. the Russian version of Firefox, and discovered that it enabled the encoding heuristics: Thus it worked! Had it not done that, then it would instead have used Windows-1252 as the default ... So you perhaps need to be careful before telling them to disable heuristics ... Btw: In Firefox, then in one sense, it is impossible to disable "automatic" character detection: In Firefox, overriding of the encoding only lasts until the next reload. However, I just discovered that in Opera, this is not the case: If you select Windows-1252 in Opera, then it will always - but online the current Tab - be Windows-1252, even if there is a BOM and everything. In a way Opera's behaviour makes you want to avoid setting the encoding manually in Opera. Browsers are surprisingly different in these details ... > (Note that zh-TW seems to be an exception to general observation that > the locale's language has no browser-supported legacy encoding. > However, zh-TW enables the universal heuristic encoding detector by > default, so the fallback encoding matters less.) Serb has - or rather: there exists - a browser-supported legacy encoding: WIN-1251. I have not evaluated Croat properly in that regard. Test page for checking the encoding default: http://malform.no/testing/encodingdefault Leif H Silli
Received on Thursday, 8 December 2011 14:33:54 UTC