- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Sun, 26 Jan 2014 15:08:46 +0900
- To: Andrew Cunningham <lang.support@gmail.com>
- CC: www-international@w3.org, Richard Ishida <ishida@w3.org>
Hello Andrew, On 2014/01/25 16:39, Andrew Cunningham wrote: > On 25/01/2014 6:06 PM, Martin J. Dürst<duerst@it.aoyama.ac.jp> wrote: >> On 2014/01/25 6:23, Andrew Cunningham wrote: >>> Most of the cases of contemporary uses of legacy encodings I know of >> >> >> Can you give examples? >> > > The key ones of the top of my head are KNU Version 2, used by the major > international S'gaw Karen news service for their website. Can you give a pointer or two? > Although KNU version 1 is more common. And is used by some publishers. Again, pointer appreciated. > Some S'gaw content is in Unicode, rare though. Some S'gaw blogs are using > pseudo-Unicode solutions. These would identify as UTF-8 but are not > Unicode. > > Similar problem with Burmese where more than 50% of web content is > pseudo-Unicode. There was an interesting talk about this phenomenon at the Internationalization and Unicode Conference last year by Brian Kemler and Craig Cornelius from Google. The abstract is at http://www.unicodeconference.org/iuc37/program-d.htm#S8-3. It would be good to know how this work has progressed, or whether there's a publicly available version of the slides. > Most eastern Cham content is using 8-bit encodings, a number of different > encodings depending on the site. Again, pointers appreciated. > Uptake of Cham Unicode limited, mainly due to fact it can't be supported on > most mobile devices. "can't be supported" sounds too negative. "isn't supported" would be better. Or is there a technical reason that mobile devices can't do it? > Cham block missing 7 characters for Western Cham. Where in the pipeline are they? > Waiting for inclusion of Pahwah Hmong and Leke scripts. > > Pahwah is next version. You mean Unicode 7.0? Good to see progress. > Leke is quite a while of. So 8-bit is only way to > go for that. And there are multiple encodings out there representing > different versions of script. Virtually every script (/language) went to such a period. >>> involve encodings not registered with IANA. >> >> >> It's not really difficult to register an encoding if it exists and is > reasonably documented. Please give it a try or encourage others to give it > a try. >> > > Problem is usually there is no documentation, only a font. Then it should be easy to create a Web page documenting the font. With the same 16x16 table, you can essentially document any 8-bit encoding. And font download these days also works quite well in many browsers. > Each font, even from same font developer may be a different encoding. > > Just for S'gaw I'd have to go through 50-100 fonts and work out how many > encodings there are. Many more than I'd like. > > Documenting and listing encodings would be a large task. Okay, then there's even more reason for working on and pushing towards Unicode and UTF-8. >> I hope that's iso-8859-1, not iso-859-1, even if that's still a blatant > lie. > > Yes, iso-8859-1 > > A lie? Probably, but considering web browsers only support a small > handful of encodings that have been used on the web, the only way to get > such content to work is by deliberately misidentifying it. I know. > The majority of legacy encodings have always had to always do this. In that sense, I don't think that "majority" will ever change. > To make it worse what happens in real life is that many such web pages use > two encodings. One for content and one for HTML markup > > Ie a page in KNU v. 2 will have content in KNU, but KNU isn't ASCII > compatible, so markup is in separate encoding. Well, the browser thinks it's iso-8859-1 anyway, so at least these parts are not lying :-(. Regards, Martin.
Received on Sunday, 26 January 2014 06:09:31 UTC