- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Wed, 14 Oct 2009 05:45:25 +0200
- To: Andrew Cunningham <andrewc@vicnet.net.au>
- CC: Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>, Ian Hickson <ian@hixie.ch>, Mark Davis ☕ <mark@macchiato.com>, Martin_J=2E_D=FCrst <duerst@it.aoyama.ac.jp>, "Phillips, Addison" <addison@amazon.com>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Larry Masinter <masinter@adobe.com>
Andrew Cunningham On 09-10-14 03.53:
> Leif Halvard Silli wrote:
>> Andrew Cunningham On 09-10-12 16.12:
>>> also not surprised by the indian localisations, had to be either
>>> utf-8 or win-1252.
[....]
>> Do you have example pages for wrong Indian language pages?
>
> Most modern Indian language websites use Unicode, so my comments were
> referring to legacy content, considering that is the context Ian has
> been referring to. If I understand correctly. Although not sure why
> HTML5 is concerning itself with legacy content, since its unlikely that
> HTML5 spec. can cover all the needs of all legacy content, best to just
> get HTML5 content right.
The reason, as much as I have picked up, is about market shares.
And the "poster child" here is Windows-1252.
And that browsers would loose market share if they stopped
treating ISO-8859-1 as Win 1252 is fine - understand that. I also
understand why browsers in my locale default to the same encoding.
But from there to say that Greek web browsers should default to
ISO-8859-1? Why? Henri? Ian? For reading of Greek texts?
> Personally I'm more concerned about limitations of correctly display
> some Unicode content than I am about supporting
L10n often meddles with the i18n, it seems. E.g. when Apple
localized OS X for Russian, Thunderbird started displaying Russian
subject lines as question marks unless you use Russian locale.
> just going throw a few online Indian language newspapers that aren't in
> utf-8, pages fall into two categories
>
> 1) No encoding declaration - so use what ever the browser default is,
> e.g. http://www.aajkaal.net/
Whose ass and which browser's market share is saved by defaulting
that page to Windows-1252. Can someone explain, please?
> 2) Declare as iso-8859-1 (which browsers treat as win-1252), e.g.
> http://www.aajkaal.net/ http://www.abasar.net/ and http://www.manoramaonline.com/ http://www.anandabazar.com/
>
> 3) declare encoding as x-user-defined, e.g. http://www.anandabazar.com/
>
> although at least in IE (English UI) x-user-defined is parsed as
> Windows-1252, so in that version of the browser declaring x-user-defined
> was effectively the same as declaring iso-8859-1 or windows-1252.
>
> Which is why a lot of legacy content in some SE Asian scripts was always
> delivered as images or PDF files, rather than as text in HTML documents.
Which are served just as well as UTF-8?
> Browsers assumed a win-1252 fall back so it was impossible to markup up
> content in some languages using legacy content. The Karen languages
> tended to fall into this category, and content is still delivered this
> way by key websites in that language, although bloggers are migrating to
> using pseudo-Unicode font solutions.
What do you mean by "pseudo-Unicode"?
> Interetsing to note that there is
> limited take up of Unicode 5.1+ solutions for Karen, since web browsers
> are unable to correctly render or display Karen Unicode documents using
> existing fonts that support the karen languages. Partly this is due to
> limitations in CSS and in web browsers.
>
> And I'm not sure how web browsers will be able to deal with the Unicode
> vs pseudo-Unicode divisions occurring in Burmese, Karen, Mon and Shan
> web content. I suspect that for these languages, browser developers have
> limited or no knowledge of how the web developer community is developing
> content in these languages or what encodings are in use. Or that there
> is even a Unicode vs pseudo-Unicode content distinction in these languages.
Forgive me for being occupied with those languages which are
already supported. Here is some Mozilla critic:
I find that Mozilla's choice of encoding in one way appears
accidentally put together. Or, rather, I wonder whether they have
used the same principles all the time. Probably not, because
different localization experts may see things differently. And
how fresh are their judgments? I already asked why the Greek
locale (el) default to Windows-1252/ISO-8859-1. And in another
letter I asked why is iso-8859-5 default for Belarusian. Also see
the table that I added at the end of my last reply to Ian.
More fundamentally: Why is "UTF-8", in fact, defined as "legacy"
in your table, Ian, for some languages? Isn't UTF-8 detectable in
the first place?
It seems obvious to me that UTF-8 has been defined as default for
the former Yugoslavia because of the good effects that it has. I
am missing a similar attitude towards many of the other languages
that currently are defined as having win1252 as default.
--
leif halvard silli
Received on Wednesday, 14 October 2009 03:46:06 UTC