Re: HTML5 Issue 11 (encoding detection): I18N WG response... from Andrew Cunningham on 2009-10-14 (public-html@w3.org from October 2009)

From: Andrew Cunningham <andrewc@vicnet.net.au>
Date: Wed, 14 Oct 2009 15:34:21 +1100
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
CC: Henri Sivonen <hsivonen@iki.fi>, Maciej Stachowiak <mjs@apple.com>, Ian Hickson <ian@hixie.ch>, Mark Davis ☕ <mark@macchiato.com>, Martin_J=2E_D=FCrst <duerst@it.aoyama.ac.jp>, "Phillips, Addison" <addison@amazon.com>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Larry Masinter <masinter@adobe.com>
Message-ID: <4AD554CD.6000100@vicnet.net.au>

Leif Halvard Silli wrote:

> Andrew Cunningham On 09-10-14 03.53:
>
>
> The reason, as much as I have picked up, is about market shares. And 
> the "poster child" here is Windows-1252.
>
I realise that, but if market share is the issue, then trhe reality is 
that microsoft is setting the trends here having the lions share of the 
market in terms of OS, and if oyu look at Microsoft policy all new 
languages if not encompassed by an existing code page are ONLY supported 
via unicode. Its been said often enough, in enough forums over the years.


>>
>>
>> 3)  declare encoding as x-user-defined, e.g. http://www.anandabazar.com/
>>
>> although at least in IE (English UI) x-user-defined is parsed as 
>> Windows-1252, so in that version of the browser declaring 
>> x-user-defined was effectively the same as declaring iso-8859-1 or 
>> windows-1252.
>>
>> Which is why a lot of legacy content in some SE Asian scripts was 
>> always delivered as images or PDF files, rather than as text in HTML 
>> documents. 
>
>
> Which are served just as well as UTF-8?
>
>> Browsers assumed a win-1252 fall back so it was impossible to markup 
>> up content in some languages using legacy content. The Karen 
>> languages tended to fall into this category, and content is still 
>> delivered this way by key websites in that language, although 
>> bloggers are migrating to using pseudo-Unicode font solutions.
>
>
> What do you mean by "pseudo-Unicode"?
>
pseudo-Unicode is the practice of remapping glyph based 8-bit legacy 
encodings to Unicode fonts, In terms of the myanmar script, for Burmese, 
etc. this means remian some glyphs to actual Unicode codepoints and 
assigning other glyphs to codepoints in the same block unused by the 
langauge in question or to the PUA and glyphs access directly by codepoint

unicode uses a character based model

pseudo-unicode uses a glyph based model that in many instances reassigns 
glyphs to codepoints required by other languages using the same script.

For instance, with Burmese, the majority of online content uses a 
pseudo-Unicode font that reuses codepoints required for Mon, S'gaw 
karen, Shan and other languages pseudo unicode data can not be correctly 
displayed or read with Unicode capable fonts either the Unicode 4.1/5,0 
version fonts or the Unicode 5.1+ fonts

At the moment pseudo Unicode is more common for Burmese web content than 
Unicode. And in some projects has lead to splintering, i.e. the Burmese 
wikipedia project that uses Unicode 5.1 vs a splinter group that created 
a new wiki using pseudo-Unicode. Its a political issue in Burmese web 
development and IT communities.


>
> Forgive me for being occupied with those languages which are already 
> supported. Here is some Mozilla critic:
>
nothing to forgive, spent many many years myself concerned about those 
languages, but there are many languages who's needs are forgotten by 
developers and specification writers.

Andrew

-- 
Andrew Cunningham
Senior Manager, Research and Development
Vicnet
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000

Ph: +61-3-8664-7430
Fax: +61-3-9639-2175

Email: andrewc@vicnet.net.au
Alt email: lang.support@gmail.com

http://home.vicnet.net.au/~andrewc/
http://www.openroad.net.au
http://www.vicnet.net.au
http://www.slv.vic.gov.au

Received on Wednesday, 14 October 2009 04:35:15 UTC