RE: Why is UTF8 not being taken up in Asia Pacific for Public Websites?

On Mon, 30 Jun 2003, Martin Duerst wrote:

> At 09:48 03/05/17 -0700, Kurosaka, Teruhiko wrote:
>
> >If UTF-8 is used at the web browser level, the mapping between

  I'm not sure what was meant by 'UTF-8 being used at the web browser level'.
Most, if not all, browsers **do** use Unicode (in one form or another) as
their internal character representation. Otherwise, it's all but impossible
to deal with bewildering arrays of legacy encodings out in the wild.

 What internal character representation is used by web browsers has little,
if any, to do with UTF-8's acceptance in Japan as a MIME charset
for the web publication.


> >the legacy encoding and UTF-8 depends on the browser and/or
> >the OS platform (if browser uses the conversion facility provided
> >by the OS platform).  It is well known that certain characters in
> >Japanese computation map differently to Unicode (thus UTF-8)
> >depending on the OS/language platforms.
> >http://www.ingrid.org/java/i18n/unicode-utf8.html
> >For example, 0x5c in Shift JIS, which is supposed to mean
> >the Japanese currency YEN SIGN but acts like a backslash
> >(0x5c in ASCII),  is treated as though it were
> >a regular backslash, and mapped to Unicode U+005C on
> >Windows

> but then displayed as a Yen sign on a Japanese system :-(.

  This is actually not a feature but a *bug* of Japaense and Korean
fonts included in MS Windows. Unicode Cmaps in those truetype fonts
map U+005C (reverse solidus) to the glyph of YEN (or WON) sign.
This is a clear violation of the Unicode standard because Unicode
and ISO 10646 never endorsed overloading of U+005C this way. It's
reverse solidus period. To me (and many Koreans and I believe a lot
of Japanese with Unix and/or TeX background), it's really annoying
to see YEN/WON sign when reverse solidus is expected.

 Whether 0x5c in Shift_JIS (Windows-949) is YEN or Reverse Solidus
may be debatable, but U+005C  doesn't have such an issue at all.
The problem is NOT with Unicode BUT with legacy encodings like
Shift_JIS and Windows-949(Unified Hangul Code). When converting
your existing documents in legacy encoding to Unicde-based encodings,
you have to come up with a heuristic (or write an intelligent
program) that can tell one usage of 0x5c in SJIS (as a path separator
or escape character) from the other (YEN sign) and map to either
U+005C or U+00A5 accordingly.  Obviously, it can't be 100% automated
and requires human intervention.  However, once that's done, there's
no more issue (provided that MS fixes their fonts) because Unicode
does NOT overload U+005c as legacy encodings overload 0x5c.

  It's regrettable that Microsoft keeps this issue "alive" by
shipping broken Japanese/Korean fonts in Japanese and Korean Windows
and other language versions of Windows 2k/XP and NOT fixing
Japanese/Korean IMEs in such a way that U+005C, U+00A9 YEN SIGN
(U+20A9 WON SIGN) and U+FFE5 Fullwidth Yen Sign (U+FFE6 Fullwidth
Won Sign) can be entered *distinctively*.  I showed how it can be
done on the Unicode list several months ago.


> >but it is mapped to U+00A5 (YEN SIGN) on
> >MacOS.

  This mapping is as much problematic as Windows' because 0x5c
in SJIS is overloaded and the mapping should be *context-dependent*.
A Japanese LaTeX user would be very unhappy if she finds that
all the reverse solidus in her paper typeset in LaTeX was turned
to Yen Sign.


> >So the character that the user peceives the same are
> >handled and stored differently by the application, if we
> >take the approach to let the browser convert to UTF-8.

   As mentioend above, browsers do convert to Unicode internally.
This thread (at least when it was started) is not about what browsers
to do but about how to make UTF-8(or other Unicode/UCS transformation
formats) more widely used.


> >Supoose the (half-size) YEN SIGN is entered from the MacOS,
> >stored in the database.  Later sobody view the data from
> >Windows, that data could be displayed as a square (meaning
> >the system cannot display this character).
>
> Browsers on windows systems should be able to display this
> character correctly. After all, this character is part of
> Latin-1, which is what the Web started with.

  Absolutely. I can't think of any version of Windows that does
not have at least one font that covers Latin-1.


> >Has anyone experienced problems like this in reality?  Do
> >popular browsers do code conversion by themselves, or
> >do they use OS facilities?
>
> I think it depends on the browser. Different browsers
> have different strategies of how much of the underlying
> platform they use.

  That's right. In case of Mozilla,
it has its built-in encoding converters (intl/uconv) and use them
for most of tasks, but when it comes to interaction with a local
system (saving to/opening a local file. not the content of files but
file and path names)), it uses facilities provided
by the OS. However, it has an option (build-option) that makes
Mozilla entirely rely on iconv(3) on POSIX systems. This option is
for small embedded devices (Linux-based) for which reducing the
memory footprint is very important.


>  From another mail:
>
>  > (2) Some HTML browser do not support UTF-8.  (All popular
>  > browsers for desktop support UTF-8 since a few years ago
>  > but web-phone browser support only a legacy encoding. See
>  > i-mode spec.)
>
> The newest mobile phones in Japan have started to support
> UTF-8.

  That's good to hear. Certainly, embedded devices
are rather slow in Unicode support. For instance, I suspect Apple's
i-Pod (that supports several different scripts) still store filenames
in legacy encodings. The same is true of many other mp3 players and
other embedded devices. They're not solely to blame, though because
it's the limitation of ID3 V1 that is partly responsible for this.

  Back to web browsers, Lynx (a widely used text-mode browser) supports
UTF-8, but its internal converters don't know how to convert between
multibyte legacy encodings and Unicode so that it can't be a
text-equivalent of GUI-based browsers for CJK users. Moreover, it
doesn't know that the number of bytes used to represent a single
character in UTF-8 has little to do with the column width of the
character (therefore, linewrapping is all screwed-up when viewing
web pages in UTF-8 with a significant number of non-ASCII characters.)
On the other hand, w3m-m17n (developed in Japan) is an excellent
text-mode equivalent of GUI-based browsers.


  Jungshik

Received on Monday, 30 June 2003 22:16:34 UTC