- From: Andrew Cunningham <lang.support@gmail.com>
- Date: Sat, 25 Jan 2014 18:39:00 +1100
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: www-international@w3.org, Richard Ishida <ishida@w3.org>
- Message-ID: <CAGJ7U-VS=hCaEeU_yUACawta4inJ1Cns4qeEYdZx0R2HOHWiCg@mail.gmail.com>
On 25/01/2014 6:06 PM, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote: > > Hello Andrew, > > > On 2014/01/25 6:23, Andrew Cunningham wrote: >> >> Hì Richard, >> >> Most of the cases of contemporary uses of legacy encodings I know of > > > Can you give examples? > The key ones of the top of my head are KNU Version 2, used by the major international S'gaw Karen news service for their website. Although KNU version 1 is more common. And is used by some publishers. Some S'gaw content is in Unicode, rare though. Some S'gaw blogs are using pseudo-Unicode solutions. These would identify as UTF-8 but are not Unicode. Similar problem with Burmese where more than 50% of web content is pseudo-Unicode. Most eastern Cham content is using 8-bit encodings, a number of different encodings depending on the site. Uptake of Cham Unicode limited, mainly due to fact it can't be supported on most mobile devices. Cham block missing 7 characters for Western Cham. Waiting for inclusion of Pahwah Hmong and Leke scripts. Pahwah is next version. Leke is quite a while of. So 8-bit is only way to go for that. And there are multiple encodings out there representing different versions of script. > >> involve encodings not registered with IANA. > > > It's not really difficult to register an encoding if it exists and is reasonably documented. Please give it a try or encourage others to give it a try. > Problem is usually there is no documentation, only a font. Each font, even from same font developer may be a different encoding. Just for S'gaw I'd have to go through 50-100 fonts and work out how many encodings there are. Many more than I'd like. Documenting and listing encodings would be a large task. > >> Historical solutions are to just identify these encodings as iso-859-1 / >> windows-1252 > > > I hope that's iso-8859-1, not iso-859-1, even if that's still a blatant lie. > Yes, iso-8859-1 A lie? Probably, but considering web browsers only support a small handful of encodings that have been used on the web, the only way to get such content to work is by deliberately misidentifying it. The majority of legacy encodings have always had to always do this. To make it worse what happens in real life is that many such web pages use two encodings. One for content and one for HTML markup Ie a page in KNU v. 2 will have content in KNU, but KNU isn't ASCII compatible, so markup is in separate encoding. Andrew
Received on Saturday, 25 January 2014 07:39:28 UTC