Re: HTML5 Issue 11 (encoding detection): I18N WG response...

Leif Halvard Silli wrote:
> Andrew Cunningham On 09-10-12 16.12:
>> also not surprised by the indian localisations, had to be either 
>> utf-8 or
>> win-1252. and guess win-1252 is a logical choice since firefox doesn't
>> really support legacy encodings for Indian languages, and good 
>> percentage
>> of legacy content in indian languages is misidentifying itself as
>> iso-8859-1 or windows-1252 and relying on styling.
>
> Styling? You mean, the good old "font tag considered harmful" effect? 
> Is that even possible to get to work any more? I know that Hebrew on 
> the Web used to apply similar tricks - I think they used  "the default 
> latin encoding" and then "turned the text". But still, win-1252 isn't 
> the default encoding of Hebrew?!
>
> Do you have example pages for wrong Indian language pages?

Most modern Indian language websites use Unicode, so my comments were 
referring to legacy content, considering that is the context Ian has 
been referring to. If I understand correctly. Although not sure why 
HTML5 is concerning itself with legacy content, since its unlikely that 
HTML5 spec. can cover all the needs of all legacy content, best to just 
get HTML5 content right.

Personally I'm more concerned about limitations of correctly display 
some Unicode content than I am about supporting

just going throw a few online Indian language newspapers that aren't in 
utf-8, pages fall into two categories

1) No encoding declaration - so use what ever the browser default is, 
e.g. http://www.aajkaal.net/

2) Declare as iso-8859-1 (which browsers treat as win-1252), e.g. 
http://www.abasar.net/ and http://www.manoramaonline.com/

3)  declare encoding as x-user-defined, e.g. http://www.anandabazar.com/

although at least in IE (English UI) x-user-defined is parsed as 
Windows-1252, so in that version of the browser declaring x-user-defined 
was effectively the same as declaring iso-8859-1 or windows-1252.

Which is why a lot of legacy content in some SE Asian scripts was always 
delivered as images or PDF files, rather than as text in HTML documents. 
Browsers assumed a win-1252 fall back so it was impossible to markup up 
content in some languages using legacy content. The Karen languages 
tended to fall into this category, and content is still delivered this 
way by key websites in that language, although bloggers are migrating to 
using pseudo-Unicode font solutions. Interetsing to note that there is 
limited take up of Unicode 5.1+ solutions for Karen, since web browsers 
are unable to correctly render or display Karen Unicode documents using 
existing fonts that support the karen languages. Partly this is due to 
limitations in CSS and in web browsers.

And I'm not sure how web browsers will be able to deal with the Unicode 
vs pseudo-Unicode divisions occurring in Burmese, Karen, Mon and Shan 
web content. I suspect that for these languages, browser developers have 
limited or no knowledge of how the web developer community is developing 
content in these languages or what encodings are in use. Or that there 
is even a Unicode vs pseudo-Unicode content distinction in these languages.

Andrew

-- 
Andrew Cunningham
Senior Manager, Research and Development
Vicnet
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000

Ph: +61-3-8664-7430
Fax: +61-3-9639-2175

Email: andrewc@vicnet.net.au
Alt email: lang.support@gmail.com

http://home.vicnet.net.au/~andrewc/
http://www.openroad.net.au
http://www.vicnet.net.au
http://www.slv.vic.gov.au

Received on Wednesday, 14 October 2009 01:53:47 UTC