- From: Jungshik Shin <jshin@mailaps.org>
- Date: Fri, 4 Jul 2003 09:31:28 -0400 (EDT)
- To: "Kurosaka, Teruhiko" <Teruhiko.Kurosaka@iona.com>
- cc: <www-international@w3.org>
On Tue, 1 Jul 2003, Kurosaka, Teruhiko wrote: > > Most, if not all, browsers **do** use Unicode (in one form or > > another) as their internal character representation. Otherwise, > > it's all but impossible to deal with bewildering arrays of legacy > > encodings out in the wild. > Netscape browser was supporting many legacy encodings > before Unicode became popular. I don't think use of Unicode > is necesity to support legacy code set, although it would make > the internal design much easier. I'm aware that it did, but I wouldn't say 'many' legacy encodings (Netscape 2.x just supported a few legacy encodings) were supported. (Netscape 4.x surely used Unicode as the internal char. representation and I'm not sure of Netscape 3.x) Nonetheless, I admit I went a bit too far to say it's 'all but impossible'. > > > but then displayed as a Yen sign on a Japanese system :-(. > > > > This is actually not a feature but a *bug* of Japaense and Korean > > fonts included in MS Windows. Unicode Cmaps in those truetype fonts > > You may call it a bug. How can't it be a bug? Please, note that it's not 0x5c in legacy encodings but U+005C in Unicode that's at stake. When I view a web page/text document in UTF-8(UTF-16, UTF-32) with a Japanese(Korean) font included in Japanese(Korean) Windows, U+005C is rendered as YEN(WON) simply because the font has the glyph of YEN(WON) SIGN for U+005C. That is, what I get for U+005C is soley dependent on what font I use. U+005C is Reverse Solidus period. And, I want it to be treated as such no matter what font (with a Unicode cmap) I use. > But the reality is there are such many > implementations that display U+005C that you cannot simply ignore, > and they won't go away soon. Implementations that display U+005C? Again, I'm not sure what you meant by this. Did you mean implementations that render U+005C with the glyph for YEN or implementations that treat U+005C as if it's YEN? As for the former, it's not implementation but faulty fonts that give that 'illusion'. As for the latter, they're not compliant to Unicode/ISO 10646 and have to be fixed. For instance, an application localized to Japanese that uses 'U+005C' for the currency sign is buggy. If you meant there are a lot of documents in Shift_JIS where 0x5c is meant as YEN, nobody would dispute that, There are lots of such 'documents' in shift_jis. However, there are as many documents in shift_jis in which 0x5c is used as Reverse Solidus. However, this has nothing to do with Unicode and can never be an excuse for distributing broken fonts I'm talking about. There's only one Unicode withuot any ambiguity in the meaning of U+005C. The fact that 0x5c is overloaded in Shift_JIS presents a conversion-hassle for legacy documents, but trying to solve this problem with broken fonts would not help you with this conversion issue at all. It only results in more and more documents with overloaded 0x5c and even worse overloaded U+005C (that should not be overloaded.) Actually, this YEN/Reverse Solidus ambiguity issue is NOT a reason to keep on using Shift_JIS BUT a strong reason to switch over to Unicode as soon as poosible because Unicode doesn't have this ambiguity at all provided that broken fonts are fixed. This switch-over to Unicode would need some manual fix-ups (to tell reverse solidus from Yen). Once done, however, there's no more degeneracy to break. Back to the topic of this thread, based on my observation in Korea, I came up with a couple of reasons (other than they don't have any incentive/need to switch, the file/transfer size is bigger, and so forth) why UTF-8 is not as widely used as we think it should be. Note that Koreans don't have a prejudice against Unicode as is sometimes found among Japanese due to misunderstanding about the unification of Hanzi/Kanji/Hanja. It's rather the opposite in that Unicode was widely hailed as a new character set for Korean script without any hindrance thought of as present in KS C 5601-based encodings (EUC-KR and such) that made it impossible to use their script with its full potential and 'expressive power'. Although there are a number of Unicode-capable editors (as opposed to word processors) for three major platforms these days, I found most (Windows) users in Korea either don't know about them or have hard time finding one that fits their need. I was surprised to find several popular shareware/freeware/commerical text editors for Korean still offer only two or three encodings for file operation, EUC-KR(WANSUNG), JOHAB and UHC(Windows-949). Nowhere is UTF-8 found. It looks as though authors of those editors still lived in 1995. Another possible cause is that one of popular server-side scripting languages, PHP didn't have a good Unicode(and multibyte encoding support including UTF-8 [1] ) support until recently (version 4.x) and a lot of scripts are still based on PHP 3.x. The same is true of widely used DBMS MySQL. It didn't support any multibyte encoding in 3.x However, in case of MySQL _without_ multibyte encoding support, UTF-8 is an actually better choice than legacy multibyte encodings because with UTF-8 there's no chance we hit upon a false match (in DB search) because the trailing byte of a character and the lead byte of next character matches a third (completely unrelated) character. Adding to these is the ignorance of on-line lecturers and authors of books on web authoring. Most of them 'preach' their audiences to tag their documents with "EUC-KR" ( and sometimes totally misleading 'ks_c_5601-1987' that should have never been used as MIME charset name). Given that there are still a lot of Win 9x/ME users (Win98 seems to be the majority in Korea) and that some stock tools like Notepad/Wordpad under Unicode-savvy Win 2x/XP still give a favorable [2] treatment of the legacy encoding corresponding to the default system locale (This can be changed at all, but not many web developers know that), it's not entirely their fault. The situation on Unix/Linux side is similar. Although Sun and IBM have shipped Solaris and AIX with UTF-8 locales for CJK (ko_KR.UTF-8 was one of the very first two UTF-8 locales for Solaris 2.6 along with en_US.UTF-8 which I believe is because EUC-KR is inadequate even for modern Korean unless the obscure 8byte 'extension' in annex 3 of KS X 1001:1998/KS C 5601-1992 is implemented) for several years by now, surrounded by emails/web pages in legacy encodings, not many people had an incentive to switch. Besides, the adoption of Windows-949 in Korean Windows (that is upward compatible with EUC-KR and can represent the full repertoire of modern Korean syllables) effectively lengthened the life of EUC-KR. In case of Linux, it's about a year or so ago when its UTF-8 support became mature to the extent that I could tell ordinary users to switch. It's unfortunate that last fall one of the major vendors of Linux distributions, RedHat decided not to ship their RedHat Linux 8 without UTF-8 locales for CJK (while supporting zh_CN.gb18030 which is just another UTF disguised as a 'legacy' encoding and switching to UTF-8 for all otehr locales. see <https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829>). I believe RedHat 9 comes with UTF-8 locales for CJK. To overcome these factors, some aggressive 'campaigning' and education seem to be necessary. Jungshik [1] Without multibyte encoding support, legacy encodings for CJK can be still used with PHP 3.x scripts because in most cases the assumption that one octet corresponds to one column width (sometimes important in designing 'UI' elements) and other 'naive' assumptions (that don't hold for UTF-8) hold. [2] Unless a UTF-8 text file begins with BOM (UTF-8), notepad and wordpad under Win2k/XP assume the encoding of the file to be the default system codepage. That is understandable, but there's no way to force it to use UTF-8 on opening a file (or after opening a file) although you can save in UTF-8.
Received on Friday, 4 July 2003 09:31:49 UTC