W3C home > Mailing lists > Public > www-international@w3.org > January to March 2014

Re: Encoding: Referring people to a list of labels

From: Andrew Cunningham <lang.support@gmail.com>
Date: Mon, 27 Jan 2014 13:11:26 +1100
Message-ID: <CAGJ7U-X+Qg_+nvTSqsPoUc820N157LEH4obRAudU1RS49=xjoA@mail.gmail.com>
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: www-international@w3.org, Richard Ishida <ishida@w3.org>
Hi Martin,
On 26/01/2014 5:09 PM, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote:
>
> Hello Andrew,
>
>
> On 2014/01/25 16:39, Andrew Cunningham wrote:
>>
>> On 25/01/2014 6:06 PM, Martin J. Dürst<duerst@it.aoyama.ac.jp>  wrote:
>
>
>>> On 2014/01/25 6:23, Andrew Cunningham wrote:
>
>
>>>> Most of the cases of contemporary uses of legacy encodings I know of
>>>
>>>
>>>
>>> Can you give examples?
>>>
>>
>> The key ones of the top of my head are KNU Version 2, used by the major
>> international S'gaw Karen news service for their website.
>
>
> Can you give a pointer or two?
>

For KNU2:

Kwe Ka Lu
http://kwekalu.net/

Main S'gaw Karen News paper outside of Myanmar/Burma.

>> Although KNU version 1 is more common. And is used by some publishers.
>
>
> Again, pointer appreciated.

For KNU1:
Drum Publications
http://www.drumpublications.org/dictionary.php?look4e=water&look4k=&submit=Lookup#

Main S'gaw Karen÷English dictionary.

>> Some S'gaw content is in Unicode,  rare though. Some S'gaw blogs are
using
>> pseudo-Unicode solutions.  These would identify as UTF-8 but are not
>> Unicode.
>>
>> Similar problem with Burmese where more than 50% of web content is
>> pseudo-Unicode.
>
>
> There was an interesting talk about this phenomenon at the
Internationalization and Unicode Conference last year by Brian Kemler and
Craig Cornelius from Google. The abstract is at
http://www.unicodeconference.org/iuc37/program-d.htm#S8-3. It would be good
to know how this work has progressed, or whether there's a publicly
available version of the slides.
>
>

It would be useful to have access to the paper,  although there reference
to the MM3 font in the abstract worries me that they might still have
lessons to learn.

And they discuss Burmese,  but there are also pseudo-Unicode solutions for
Shan/Tai, Mon, and Karen languages a well as Burmese.

It started to look like Unicode might have replaced them but the prevalence
of mobile platforms has revived Pseudo-Unicode.

>> Most eastern Cham content is using 8-bit encodings,  a number of
different
>> encodings depending on the site.
>
>
> Again, pointers appreciated.
>
>
I will send links when I am back in office.  Public holidays here.

>> Uptake of Cham Unicode limited, mainly due to fact it can't be supported
on
>> most mobile devices.
>
>
> "can't be supported" sounds too negative. "isn't supported" would be
better. Or is there a technical reason that mobile devices can't do it?
>

Isn't supported may be better. As yet there is no official guidance in OT
documentation on what OT features should be used.

So options are to use apply more common undo  features to Cham script,
although this may work on hb-ng,  it may not work in other renderers.

Likewise DFLT script could be used but only a very limited set of features
are available.

The next issue is which version of OS is needed,  and for android most
devices use older versions.

And how up-to-date the rendering system is.

Then there is the issue of how to get fonts onto system. Which usually
requires rooting or jailbrealing a device,  and may require software piracy
as well.

>> Cham block missing 7 characters for Western Cham.
>
>
> Where in the pipeline are they?
>

Not in the pipeline. I am working on a draft proposal in my spare time.

>
>> Waiting for inclusion of Pahwah Hmong and Leke scripts.
>>
>> Pahwah is next version.
>
>
> You mean Unicode 7.0? Good to see progress.
>
Yes,  that's my understanding.

>> Leke is quite a while of. So 8-bit is only way to
>> go for that. And there are multiple encodings out there representing
>> different versions of script.
>
>
> Virtually every script (/language) went to such a period.
>
>

Yes,  although these are quite a few scripts in that category,  even if
they are in Unicode.

Issue is lag between being in Unicode and OS and device vendors supporting
them.

Conidering many vendors don't even fully support everything in Unicode 5.1
yet,  and 7.0 is around the corner....

>>>> involve encodings not registered with IANA.
>>>
>>>
>>>
>>> It's not really difficult to register an encoding if it exists and is
>>
>> reasonably documented. Please give it a try or encourage others to give
it
>> a try.
>>>
>>>
>>
>> Problem is usually there is no documentation,  only a font.
>
>
> Then it should be easy to create a Web page documenting the font. With
the same 16x16 table, you can essentially document any 8-bit encoding. And
font download these days also works quite well in many browsers.
>

That part is simple.  But insufficent by itself.

Ideally you need to document glyph to unified codepoints(s ).

And map all necessary reorderings of character sequences.

We have done TECkit mappings for some Karen fonts and working on some Cham
mappings as well.

And when I have spare time will work on porting a set of Karen and Cham
legacy fonts to Unicode.

>
>> Each font,  even from same font developer may be a different encoding.
>>
>> Just for S'gaw I'd have to go through 50-100 fonts and work out how many
>> encodings there are.  Many more than I'd like.
>>
>> Documenting and listing encodings would be a large task.
>
>
> Okay, then there's even more reason for working on and pushing towards
Unicode and UTF-8.
>

I totally agree.

We are working on building blocks;

* Mappings to convert data to Unicode.
* fonts that fit the language specific typographic requirements
* input systems that match user expectations and facilitate uptake of
Unicode.
* JavaScript scripts to overcome limitations in web browsers
* collation routines
* locale development

Etc

>>> I hope that's iso-8859-1, not iso-859-1, even if that's still a blatant
>>
>> lie.
>>
>> Yes,  iso-8859-1
>>
>> A lie?  Probably,  but considering web browsers only support a small
>> handful of encodings that have been used on the web,  the only way to get
>> such content to work is by deliberately misidentifying it.
>
>
> I know.
>
>
>> The majority of legacy encodings have always had to always do this.
>
>
> In that sense, I don't think that "majority" will ever change.
>
>
>> To make it worse what happens in real life is that many such web pages
use
>> two encodings.  One for content and one for HTML markup
>>
>> Ie  a page in KNU v. 2 will have content in KNU,  but KNU isn't ASCII
>> compatible,  so markup is in separate encoding.
>
>
> Well, the browser thinks it's iso-8859-1 anyway, so at least these parts
are not lying :-(.
>

Yep
Received on Monday, 27 January 2014 02:11:59 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:36 UTC