Re: Encoding: Referring people to a list of labels from Andrew Cunningham on 2014-01-25 (www-international@w3.org from January to March 2014)

From: Andrew Cunningham <lang.support@gmail.com>
Date: Sat, 25 Jan 2014 18:39:00 +1100
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: www-international@w3.org, Richard Ishida <ishida@w3.org>
Message-ID: <CAGJ7U-VS=hCaEeU_yUACawta4inJ1Cns4qeEYdZx0R2HOHWiCg@mail.gmail.com>

On 25/01/2014 6:06 PM, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote:
>
> Hello Andrew,
>
>
> On 2014/01/25 6:23, Andrew Cunningham wrote:
>>
>> Hì Richard,
>>
>> Most of the cases of contemporary uses of legacy encodings I know of
>
>
> Can you give examples?
>

The key ones of the top of my head are KNU Version 2, used by the major
international S'gaw Karen news service for their website.

Although KNU version 1 is more common. And is used by some publishers.

Some S'gaw content is in Unicode,  rare though. Some S'gaw blogs are using
pseudo-Unicode solutions.  These would identify as UTF-8 but are not
Unicode.

Similar problem with Burmese where more than 50% of web content is
pseudo-Unicode.

Most eastern Cham content is using 8-bit encodings,  a number of different
encodings depending on the site.

Uptake of Cham Unicode limited, mainly due to fact it can't be supported on
most mobile devices.

Cham block missing 7 characters for Western Cham.

Waiting for inclusion of Pahwah Hmong and Leke scripts.

Pahwah is next version. Leke is quite a while of. So 8-bit is only way to
go for that. And there are multiple encodings out there representing
different versions of script.

>
>> involve encodings not registered with IANA.
>
>
> It's not really difficult to register an encoding if it exists and is
reasonably documented. Please give it a try or encourage others to give it
a try.
>

Problem is usually there is no documentation,  only a font.

Each font,  even from same font developer may be a different encoding.

Just for S'gaw I'd have to go through 50-100 fonts and work out how many
encodings there are.  Many more than I'd like.

Documenting and listing encodings would be a large task.

>
>> Historical solutions are to just identify these encodings as iso-859-1 /
>> windows-1252
>
>
> I hope that's iso-8859-1, not iso-859-1, even if that's still a blatant
lie.
>

Yes,  iso-8859-1

A lie?  Probably,  but considering web browsers only support a small
handful of encodings that have been used on the web,  the only way to get
such content to work is by deliberately misidentifying it.

The majority of legacy encodings have always had to always do this.

To make it worse what happens in real life is that many such web pages use
two encodings.  One for content and one for HTML markup

Ie  a page in KNU v. 2 will have content in KNU,  but KNU isn't ASCII
compatible,  so markup is in separate encoding.

Andrew

Received on Saturday, 25 January 2014 07:39:28 UTC