RE: Why is UTF8 not being taken up in Asia Pacific for Public Websites? from Jungshik Shin on 2003-07-04 (www-international@w3.org from July to September 2003)

From: Jungshik Shin <jshin@mailaps.org>
Date: Fri, 4 Jul 2003 09:31:28 -0400 (EDT)
To: "Kurosaka, Teruhiko" <Teruhiko.Kurosaka@iona.com>
cc: <www-international@w3.org>
Message-ID: <Pine.BSO.4.33.0307011913240.2548-100000@callisto.jtan.com>
On Tue, 1 Jul 2003, Kurosaka, Teruhiko wrote:

> > Most, if not all, browsers **do** use Unicode (in one form or
> > another) as  their internal character representation. Otherwise,
> > it's all but impossible  to deal with bewildering arrays of legacy
> > encodings out in the wild.

> Netscape browser was supporting many legacy encodings
> before Unicode became popular.  I don't think  use of Unicode
> is necesity to support legacy code set, although it would make
> the internal design much easier.

I'm aware that it did, but I wouldn't say 'many' legacy encodings
(Netscape 2.x just supported a few legacy encodings) were supported.
(Netscape 4.x surely used Unicode as the internal char. representation
and I'm not sure of Netscape 3.x) Nonetheless, I admit I went a bit
too far to say it's 'all but impossible'.

> > > but then displayed as a Yen sign on a Japanese system :-(.
> >
> >   This is actually not a feature but a *bug* of Japaense and Korean
> > fonts included in MS Windows. Unicode Cmaps in those truetype fonts
>
> You may call it a bug.

  How can't it be a bug?  Please, note that it's not 0x5c in legacy
encodings but U+005C in Unicode that's at stake. When I view a web
page/text document in UTF-8(UTF-16, UTF-32) with a Japanese(Korean)
font  included in Japanese(Korean) Windows, U+005C is rendered as
YEN(WON) simply because the font has the glyph of YEN(WON) SIGN for
U+005C. That is, what I get for U+005C is soley dependent on what
font I use.  U+005C is Reverse Solidus period. And, I want it to
be treated as such no matter what font (with a Unicode cmap) I use.


> But the reality is there are such many
> implementations that display U+005C that you cannot simply ignore,
> and they won't go away  soon.

  Implementations that display U+005C? Again, I'm not sure what you
meant  by this. Did you mean implementations that render U+005C
with the glyph for YEN or implementations that treat U+005C as if
it's YEN?  As for the former, it's not implementation but faulty fonts
that give that 'illusion'. As for the latter, they're not compliant to
Unicode/ISO 10646 and have to be fixed.  For instance, an application
localized to Japanese that uses 'U+005C'  for the currency sign is
buggy.

  If you meant there are a lot of documents in Shift_JIS where 0x5c
is meant as YEN, nobody would dispute that, There are lots of such
'documents' in shift_jis. However,  there are as many documents in shift_jis
in which 0x5c is used as Reverse Solidus.  However, this has nothing
to do with Unicode and can never be an excuse for distributing
broken fonts I'm talking about.  There's only one Unicode withuot
any ambiguity in the meaning of  U+005C.  The fact that 0x5c is
overloaded in Shift_JIS presents a conversion-hassle for legacy
documents, but trying to solve  this problem with broken fonts would
not help you with this conversion issue at all. It only results in
more and more documents with overloaded 0x5c and even worse overloaded
U+005C (that should not be overloaded.)

  Actually, this YEN/Reverse Solidus ambiguity issue is NOT a reason
to keep on using Shift_JIS BUT a strong reason to switch over to
Unicode  as soon as poosible  because Unicode doesn't have this
ambiguity at all provided that broken fonts are fixed.

  This switch-over to Unicode would need some manual fix-ups (to tell
reverse solidus from Yen). Once done, however, there's no more degeneracy
to break.


  Back to the topic of this thread, based on my observation in
Korea, I came up with a couple of  reasons (other than they don't
have any incentive/need to switch, the file/transfer size is bigger,
and so forth) why  UTF-8 is not as widely used as we think it should
be.  Note that Koreans don't have a prejudice against Unicode as
is sometimes found among Japanese due to misunderstanding about the
unification of Hanzi/Kanji/Hanja. It's rather the opposite in that
Unicode was widely hailed  as a new character set for Korean script
without any hindrance thought of as present in KS C 5601-based
encodings (EUC-KR and such) that made it impossible to use their
script with its full potential and 'expressive power'.

Although there are a number of Unicode-capable editors (as opposed
to word processors) for three major platforms these days, I found
most (Windows) users in Korea either don't know about them or have
hard time finding one that fits their need. I was surprised to find
several popular shareware/freeware/commerical text editors for
Korean still offer only two or three encodings for file operation,
EUC-KR(WANSUNG), JOHAB and UHC(Windows-949). Nowhere is UTF-8 found.
It looks as though authors of those editors still lived in 1995.
Another possible cause is that one of popular server-side scripting
languages, PHP didn't have a good Unicode(and multibyte encoding
support including UTF-8 [1] ) support until recently (version 4.x)
and a lot of scripts are still based on PHP 3.x. The same is true
of widely used DBMS MySQL. It didn't support any multibyte encoding
in 3.x However, in case of MySQL _without_ multibyte encoding
support, UTF-8 is an actually better choice than legacy multibyte
encodings because with UTF-8 there's no chance we hit upon a false
match (in DB search) because the trailing byte of a character and
the lead byte of next character matches  a third (completely
unrelated) character.

  Adding to these is the ignorance of on-line lecturers  and authors
of books on web authoring. Most of them 'preach' their audiences
to tag their documents with "EUC-KR" ( and sometimes totally
misleading 'ks_c_5601-1987' that should have never been used as
MIME charset name).  Given that there are still a lot of Win 9x/ME
users (Win98 seems to be the majority in Korea) and that some stock
tools like Notepad/Wordpad under Unicode-savvy Win 2x/XP still give
a favorable [2] treatment of the legacy encoding  corresponding to the
default system locale (This can be changed at all, but not many
web developers know that), it's not entirely their fault.

  The situation on Unix/Linux side is similar. Although Sun and IBM
have shipped Solaris and AIX with  UTF-8 locales for CJK (ko_KR.UTF-8
was one of the very first two UTF-8 locales for Solaris 2.6 along
with en_US.UTF-8 which I believe is because EUC-KR is inadequate
even for modern Korean unless the obscure 8byte 'extension' in annex
3 of KS X 1001:1998/KS C 5601-1992 is implemented) for several years
by now, surrounded by emails/web pages in legacy encodings, not
many people had an incentive to switch. Besides, the adoption of
Windows-949 in Korean Windows (that is upward compatible with EUC-KR
and can represent the full repertoire of modern Korean syllables)
effectively lengthened the life of EUC-KR.  In case of Linux, it's
about a year or so ago when its UTF-8 support became  mature to the
extent that I could tell ordinary users to switch. It's
unfortunate that last fall one of the major vendors of Linux
distributions, RedHat decided not to ship their RedHat Linux 8 without UTF-8
locales for CJK (while supporting zh_CN.gb18030 which is just another
UTF  disguised as a 'legacy' encoding and switching to UTF-8 for
all otehr locales. see
<https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829>). I believe
RedHat 9 comes with UTF-8 locales for CJK.

  To overcome these factors, some aggressive 'campaigning' and education
seem to be necessary.

  Jungshik


[1] Without multibyte encoding support, legacy encodings for CJK can be
still used with PHP 3.x scripts because in most cases the assumption
that one octet corresponds to one column width (sometimes important
in designing 'UI' elements) and other 'naive' assumptions (that don't
hold for UTF-8) hold.

[2] Unless a UTF-8 text file begins with BOM (UTF-8), notepad and
wordpad under Win2k/XP assume the encoding of the file to be  the
default system codepage. That is understandable, but there's no way
to force it to use UTF-8 on opening a file (or after opening a file)
although you can save in UTF-8.
Received on Friday, 4 July 2003 09:31:49 UTC