Re: Encoding: Referring people to a list of labels from Martin J. Dürst on 2014-01-26 (www-international@w3.org from January to March 2014)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Sun, 26 Jan 2014 15:08:46 +0900
To: Andrew Cunningham <lang.support@gmail.com>
CC: www-international@w3.org, Richard Ishida <ishida@w3.org>
Message-ID: <52E4A66E.4050001@it.aoyama.ac.jp>

Hello Andrew,

On 2014/01/25 16:39, Andrew Cunningham wrote:
> On 25/01/2014 6:06 PM, Martin J. Dürst<duerst@it.aoyama.ac.jp>  wrote:

>> On 2014/01/25 6:23, Andrew Cunningham wrote:

>>> Most of the cases of contemporary uses of legacy encodings I know of
>>
>>
>> Can you give examples?
>>
>
> The key ones of the top of my head are KNU Version 2, used by the major
> international S'gaw Karen news service for their website.

Can you give a pointer or two?

> Although KNU version 1 is more common. And is used by some publishers.

Again, pointer appreciated.

> Some S'gaw content is in Unicode,  rare though. Some S'gaw blogs are using
> pseudo-Unicode solutions.  These would identify as UTF-8 but are not
> Unicode.
>
> Similar problem with Burmese where more than 50% of web content is
> pseudo-Unicode.

There was an interesting talk about this phenomenon at the 
Internationalization and Unicode Conference last year by Brian Kemler 
and Craig Cornelius from Google. The abstract is at 
http://www.unicodeconference.org/iuc37/program-d.htm#S8-3. It would be 
good to know how this work has progressed, or whether there's a publicly 
available version of the slides.

> Most eastern Cham content is using 8-bit encodings,  a number of different
> encodings depending on the site.

Again, pointers appreciated.

> Uptake of Cham Unicode limited, mainly due to fact it can't be supported on
> most mobile devices.

"can't be supported" sounds too negative. "isn't supported" would be 
better. Or is there a technical reason that mobile devices can't do it?

> Cham block missing 7 characters for Western Cham.

Where in the pipeline are they?

> Waiting for inclusion of Pahwah Hmong and Leke scripts.
>
> Pahwah is next version.

You mean Unicode 7.0? Good to see progress.

> Leke is quite a while of. So 8-bit is only way to
> go for that. And there are multiple encodings out there representing
> different versions of script.

Virtually every script (/language) went to such a period.

>>> involve encodings not registered with IANA.
>>
>>
>> It's not really difficult to register an encoding if it exists and is
> reasonably documented. Please give it a try or encourage others to give it
> a try.
>>
>
> Problem is usually there is no documentation,  only a font.

Then it should be easy to create a Web page documenting the font. With 
the same 16x16 table, you can essentially document any 8-bit encoding. 
And font download these days also works quite well in many browsers.

> Each font,  even from same font developer may be a different encoding.
>
> Just for S'gaw I'd have to go through 50-100 fonts and work out how many
> encodings there are.  Many more than I'd like.
>
> Documenting and listing encodings would be a large task.

Okay, then there's even more reason for working on and pushing towards 
Unicode and UTF-8.

>> I hope that's iso-8859-1, not iso-859-1, even if that's still a blatant
> lie.
>
> Yes,  iso-8859-1
>
> A lie?  Probably,  but considering web browsers only support a small
> handful of encodings that have been used on the web,  the only way to get
> such content to work is by deliberately misidentifying it.

I know.

> The majority of legacy encodings have always had to always do this.

In that sense, I don't think that "majority" will ever change.

> To make it worse what happens in real life is that many such web pages use
> two encodings.  One for content and one for HTML markup
>
> Ie  a page in KNU v. 2 will have content in KNU,  but KNU isn't ASCII
> compatible,  so markup is in separate encoding.

Well, the browser thinks it's iso-8859-1 anyway, so at least these parts 
are not lying :-(.

Regards,   Martin.

Received on Sunday, 26 January 2014 06:09:31 UTC