- From: Andrew Cunningham <lang.support@gmail.com>
- Date: Mon, 27 Jan 2014 13:11:26 +1100
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: www-international@w3.org, Richard Ishida <ishida@w3.org>
- Message-ID: <CAGJ7U-X+Qg_+nvTSqsPoUc820N157LEH4obRAudU1RS49=xjoA@mail.gmail.com>
Hi Martin, On 26/01/2014 5:09 PM, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote: > > Hello Andrew, > > > On 2014/01/25 16:39, Andrew Cunningham wrote: >> >> On 25/01/2014 6:06 PM, Martin J. Dürst<duerst@it.aoyama.ac.jp> wrote: > > >>> On 2014/01/25 6:23, Andrew Cunningham wrote: > > >>>> Most of the cases of contemporary uses of legacy encodings I know of >>> >>> >>> >>> Can you give examples? >>> >> >> The key ones of the top of my head are KNU Version 2, used by the major >> international S'gaw Karen news service for their website. > > > Can you give a pointer or two? > For KNU2: Kwe Ka Lu http://kwekalu.net/ Main S'gaw Karen News paper outside of Myanmar/Burma. >> Although KNU version 1 is more common. And is used by some publishers. > > > Again, pointer appreciated. For KNU1: Drum Publications http://www.drumpublications.org/dictionary.php?look4e=water&look4k=&submit=Lookup# Main S'gaw Karen÷English dictionary. >> Some S'gaw content is in Unicode, rare though. Some S'gaw blogs are using >> pseudo-Unicode solutions. These would identify as UTF-8 but are not >> Unicode. >> >> Similar problem with Burmese where more than 50% of web content is >> pseudo-Unicode. > > > There was an interesting talk about this phenomenon at the Internationalization and Unicode Conference last year by Brian Kemler and Craig Cornelius from Google. The abstract is at http://www.unicodeconference.org/iuc37/program-d.htm#S8-3. It would be good to know how this work has progressed, or whether there's a publicly available version of the slides. > > It would be useful to have access to the paper, although there reference to the MM3 font in the abstract worries me that they might still have lessons to learn. And they discuss Burmese, but there are also pseudo-Unicode solutions for Shan/Tai, Mon, and Karen languages a well as Burmese. It started to look like Unicode might have replaced them but the prevalence of mobile platforms has revived Pseudo-Unicode. >> Most eastern Cham content is using 8-bit encodings, a number of different >> encodings depending on the site. > > > Again, pointers appreciated. > > I will send links when I am back in office. Public holidays here. >> Uptake of Cham Unicode limited, mainly due to fact it can't be supported on >> most mobile devices. > > > "can't be supported" sounds too negative. "isn't supported" would be better. Or is there a technical reason that mobile devices can't do it? > Isn't supported may be better. As yet there is no official guidance in OT documentation on what OT features should be used. So options are to use apply more common undo features to Cham script, although this may work on hb-ng, it may not work in other renderers. Likewise DFLT script could be used but only a very limited set of features are available. The next issue is which version of OS is needed, and for android most devices use older versions. And how up-to-date the rendering system is. Then there is the issue of how to get fonts onto system. Which usually requires rooting or jailbrealing a device, and may require software piracy as well. >> Cham block missing 7 characters for Western Cham. > > > Where in the pipeline are they? > Not in the pipeline. I am working on a draft proposal in my spare time. > >> Waiting for inclusion of Pahwah Hmong and Leke scripts. >> >> Pahwah is next version. > > > You mean Unicode 7.0? Good to see progress. > Yes, that's my understanding. >> Leke is quite a while of. So 8-bit is only way to >> go for that. And there are multiple encodings out there representing >> different versions of script. > > > Virtually every script (/language) went to such a period. > > Yes, although these are quite a few scripts in that category, even if they are in Unicode. Issue is lag between being in Unicode and OS and device vendors supporting them. Conidering many vendors don't even fully support everything in Unicode 5.1 yet, and 7.0 is around the corner.... >>>> involve encodings not registered with IANA. >>> >>> >>> >>> It's not really difficult to register an encoding if it exists and is >> >> reasonably documented. Please give it a try or encourage others to give it >> a try. >>> >>> >> >> Problem is usually there is no documentation, only a font. > > > Then it should be easy to create a Web page documenting the font. With the same 16x16 table, you can essentially document any 8-bit encoding. And font download these days also works quite well in many browsers. > That part is simple. But insufficent by itself. Ideally you need to document glyph to unified codepoints(s ). And map all necessary reorderings of character sequences. We have done TECkit mappings for some Karen fonts and working on some Cham mappings as well. And when I have spare time will work on porting a set of Karen and Cham legacy fonts to Unicode. > >> Each font, even from same font developer may be a different encoding. >> >> Just for S'gaw I'd have to go through 50-100 fonts and work out how many >> encodings there are. Many more than I'd like. >> >> Documenting and listing encodings would be a large task. > > > Okay, then there's even more reason for working on and pushing towards Unicode and UTF-8. > I totally agree. We are working on building blocks; * Mappings to convert data to Unicode. * fonts that fit the language specific typographic requirements * input systems that match user expectations and facilitate uptake of Unicode. * JavaScript scripts to overcome limitations in web browsers * collation routines * locale development Etc >>> I hope that's iso-8859-1, not iso-859-1, even if that's still a blatant >> >> lie. >> >> Yes, iso-8859-1 >> >> A lie? Probably, but considering web browsers only support a small >> handful of encodings that have been used on the web, the only way to get >> such content to work is by deliberately misidentifying it. > > > I know. > > >> The majority of legacy encodings have always had to always do this. > > > In that sense, I don't think that "majority" will ever change. > > >> To make it worse what happens in real life is that many such web pages use >> two encodings. One for content and one for HTML markup >> >> Ie a page in KNU v. 2 will have content in KNU, but KNU isn't ASCII >> compatible, so markup is in separate encoding. > > > Well, the browser thinks it's iso-8859-1 anyway, so at least these parts are not lying :-(. > Yep
Received on Monday, 27 January 2014 02:11:59 UTC