Re: HTML5 Issue 11 (encoding detection): I18N WG response... from Leif Halvard Silli on 2009-10-14 (public-html@w3.org from October 2009)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 14 Oct 2009 05:45:25 +0200
To: Andrew Cunningham <andrewc@vicnet.net.au>
CC: Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>, Ian Hickson <ian@hixie.ch>, Mark Davis ☕ <mark@macchiato.com>, Martin_J=2E_D=FCrst <duerst@it.aoyama.ac.jp>, "Phillips, Addison" <addison@amazon.com>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Larry Masinter <masinter@adobe.com>
Message-ID: <4AD54955.6020002@xn--mlform-iua.no>

Andrew Cunningham On 09-10-14 03.53:

> Leif Halvard Silli wrote:
>> Andrew Cunningham On 09-10-12 16.12:
>>> also not surprised by the indian localisations, had to be either 
>>> utf-8 or win-1252.

    [....]

>> Do you have example pages for wrong Indian language pages?
> 
> Most modern Indian language websites use Unicode, so my comments were 
> referring to legacy content, considering that is the context Ian has 
> been referring to. If I understand correctly. Although not sure why 
> HTML5 is concerning itself with legacy content, since its unlikely that 
> HTML5 spec. can cover all the needs of all legacy content, best to just 
> get HTML5 content right.

The reason, as much as I have picked up, is about market shares. 
And the "poster child" here is Windows-1252.

And that browsers would loose market share if they stopped 
treating ISO-8859-1 as Win 1252 is fine - understand that. I also 
understand why browsers in my locale default to the same encoding.

But from there to say that Greek web browsers should default to 
ISO-8859-1? Why? Henri? Ian? For reading of Greek texts?

> Personally I'm more concerned about limitations of correctly display 
> some Unicode content than I am about supporting

L10n often meddles with the i18n, it seems. E.g. when Apple 
localized OS X for Russian, Thunderbird started displaying Russian 
subject lines as question marks unless you use Russian locale.

> just going throw a few online Indian language newspapers that aren't in 
> utf-8, pages fall into two categories
> 
> 1) No encoding declaration - so use what ever the browser default is, 
> e.g. http://www.aajkaal.net/

Whose ass and which browser's market share is saved by defaulting 
that page to Windows-1252. Can someone explain, please?

> 2) Declare as iso-8859-1 (which browsers treat as win-1252), e.g. 
>  http://www.aajkaal.net/ http://www.abasar.net/ and http://www.manoramaonline.com/ http://www.anandabazar.com/
> 
> 3)  declare encoding as x-user-defined, e.g. http://www.anandabazar.com/
> 
> although at least in IE (English UI) x-user-defined is parsed as 
> Windows-1252, so in that version of the browser declaring x-user-defined 
> was effectively the same as declaring iso-8859-1 or windows-1252.
> 
> Which is why a lot of legacy content in some SE Asian scripts was always 
> delivered as images or PDF files, rather than as text in HTML documents. 

Which are served just as well as UTF-8?

> Browsers assumed a win-1252 fall back so it was impossible to markup up 
> content in some languages using legacy content. The Karen languages 
> tended to fall into this category, and content is still delivered this 
> way by key websites in that language, although bloggers are migrating to 
> using pseudo-Unicode font solutions.

What do you mean by "pseudo-Unicode"?

> Interetsing to note that there is 
> limited take up of Unicode 5.1+ solutions for Karen, since web browsers 
> are unable to correctly render or display Karen Unicode documents using 
> existing fonts that support the karen languages. Partly this is due to 
> limitations in CSS and in web browsers.
> 
> And I'm not sure how web browsers will be able to deal with the Unicode 
> vs pseudo-Unicode divisions occurring in Burmese, Karen, Mon and Shan 
> web content. I suspect that for these languages, browser developers have 
> limited or no knowledge of how the web developer community is developing 
> content in these languages or what encodings are in use. Or that there 
> is even a Unicode vs pseudo-Unicode content distinction in these languages.

Forgive me for being occupied with those languages which are 
already supported. Here is some Mozilla critic:

I find that Mozilla's choice of encoding in one way appears 
accidentally put together. Or, rather, I wonder whether they have 
used the same principles all the time. Probably not, because 
different localization experts may see things differently. And 
how fresh are their judgments? I already asked why the Greek 
locale (el) default to Windows-1252/ISO-8859-1. And in another 
letter I asked why is iso-8859-5 default for Belarusian. Also see 
the table that I added at the end of my last reply to Ian.

More fundamentally: Why is "UTF-8", in fact, defined as "legacy" 
in your table, Ian, for some languages? Isn't UTF-8 detectable in 
the first place?

It seems obvious to me that UTF-8 has been defined as default for 
the former Yugoslavia because of the good effects that it has. I 
am missing a similar attitude towards many of the other languages 
that currently are defined as having win1252 as default.
-- 
leif halvard silli

Received on Wednesday, 14 October 2009 03:46:09 UTC