- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Wed, 14 Oct 2009 05:45:25 +0200
- To: Andrew Cunningham <andrewc@vicnet.net.au>
- CC: Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>, Ian Hickson <ian@hixie.ch>, Mark Davis ☕ <mark@macchiato.com>, Martin_J=2E_D=FCrst <duerst@it.aoyama.ac.jp>, "Phillips, Addison" <addison@amazon.com>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Larry Masinter <masinter@adobe.com>
Andrew Cunningham On 09-10-14 03.53: > Leif Halvard Silli wrote: >> Andrew Cunningham On 09-10-12 16.12: >>> also not surprised by the indian localisations, had to be either >>> utf-8 or win-1252. [....] >> Do you have example pages for wrong Indian language pages? > > Most modern Indian language websites use Unicode, so my comments were > referring to legacy content, considering that is the context Ian has > been referring to. If I understand correctly. Although not sure why > HTML5 is concerning itself with legacy content, since its unlikely that > HTML5 spec. can cover all the needs of all legacy content, best to just > get HTML5 content right. The reason, as much as I have picked up, is about market shares. And the "poster child" here is Windows-1252. And that browsers would loose market share if they stopped treating ISO-8859-1 as Win 1252 is fine - understand that. I also understand why browsers in my locale default to the same encoding. But from there to say that Greek web browsers should default to ISO-8859-1? Why? Henri? Ian? For reading of Greek texts? > Personally I'm more concerned about limitations of correctly display > some Unicode content than I am about supporting L10n often meddles with the i18n, it seems. E.g. when Apple localized OS X for Russian, Thunderbird started displaying Russian subject lines as question marks unless you use Russian locale. > just going throw a few online Indian language newspapers that aren't in > utf-8, pages fall into two categories > > 1) No encoding declaration - so use what ever the browser default is, > e.g. http://www.aajkaal.net/ Whose ass and which browser's market share is saved by defaulting that page to Windows-1252. Can someone explain, please? > 2) Declare as iso-8859-1 (which browsers treat as win-1252), e.g. > http://www.aajkaal.net/ http://www.abasar.net/ and http://www.manoramaonline.com/ http://www.anandabazar.com/ > > 3) declare encoding as x-user-defined, e.g. http://www.anandabazar.com/ > > although at least in IE (English UI) x-user-defined is parsed as > Windows-1252, so in that version of the browser declaring x-user-defined > was effectively the same as declaring iso-8859-1 or windows-1252. > > Which is why a lot of legacy content in some SE Asian scripts was always > delivered as images or PDF files, rather than as text in HTML documents. Which are served just as well as UTF-8? > Browsers assumed a win-1252 fall back so it was impossible to markup up > content in some languages using legacy content. The Karen languages > tended to fall into this category, and content is still delivered this > way by key websites in that language, although bloggers are migrating to > using pseudo-Unicode font solutions. What do you mean by "pseudo-Unicode"? > Interetsing to note that there is > limited take up of Unicode 5.1+ solutions for Karen, since web browsers > are unable to correctly render or display Karen Unicode documents using > existing fonts that support the karen languages. Partly this is due to > limitations in CSS and in web browsers. > > And I'm not sure how web browsers will be able to deal with the Unicode > vs pseudo-Unicode divisions occurring in Burmese, Karen, Mon and Shan > web content. I suspect that for these languages, browser developers have > limited or no knowledge of how the web developer community is developing > content in these languages or what encodings are in use. Or that there > is even a Unicode vs pseudo-Unicode content distinction in these languages. Forgive me for being occupied with those languages which are already supported. Here is some Mozilla critic: I find that Mozilla's choice of encoding in one way appears accidentally put together. Or, rather, I wonder whether they have used the same principles all the time. Probably not, because different localization experts may see things differently. And how fresh are their judgments? I already asked why the Greek locale (el) default to Windows-1252/ISO-8859-1. And in another letter I asked why is iso-8859-5 default for Belarusian. Also see the table that I added at the end of my last reply to Ian. More fundamentally: Why is "UTF-8", in fact, defined as "legacy" in your table, Ian, for some languages? Isn't UTF-8 detectable in the first place? It seems obvious to me that UTF-8 has been defined as default for the former Yugoslavia because of the good effects that it has. I am missing a similar attitude towards many of the other languages that currently are defined as having win1252 as default. -- leif halvard silli
Received on Wednesday, 14 October 2009 03:46:09 UTC