W3C home > Mailing lists > Public > public-html-ig-zh@w3.org > April 2012

Re: 臺灣和香港Big5 HKSCS vs UAO分析和結論

From: Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu>
Date: Mon, 23 Apr 2012 02:43:47 +0800
Message-ID: <4F945163.5000606@csail.mit.edu>
To: W3C HTML5 中文興趣小組 <public-html-ig-zh@w3.org>
CC: Ambrose LI <ambrose.li@gmail.com>, Philip Jägenstedt <philipj@opera.com>, Yuan Chao <yuanchao@gmail.com>
(12/04/22 4:26), Ambrose LI wrote:
> Personally speaking, I’d say that Big5 has always been a mess and it
> is still a mess, and the only sane way to solve this problem is to
> expose the underlying variants of Big5 in the encoding selection menu.
> Even if some sort of statistical AI technique were used there will
> still be occasions where what the machine chooses will be wrong. Just
> let the user choose if something doesn’t work.

Yeah, that's a must, but we still need to decide on a default behavior :(

(12/04/22 8:23), Yuan Chao wrote:
> http://lists.w3.org/Archives/Public/public-html-ig-zh/2011Aug/0052.html
> I didn't see any reply to Kenny's request since.

Well, I raised this issue at the time when Anne van Kesteran from Opera
started the Encoding Standards work, but no one (that included me, so no
blame on anyone) on this list expressed interest in it.

(12/04/22 1:07), Philip Jägenstedt wrote:
> On Sat, 21 Apr 2012 16:26:07 +0200, Yuan Chao <yuanchao@gmail.com> wrote:
>> A visible character is very useful instead of a fullwidth space,
>> which just hides things away.
> How U+FFFD is rendered appears to be a font issue, I presume you don't
> mean that random incorrect characters is preferable.

Unlike Yuan, my opinion is

successfully decoded characters >> U+FFFD rendered as fullwidth space >
U+FFFD rendered as some mysterious glyph > mojibake

Even if "how U+FFFD is rendered" is a font issue, if I want to make a
assessment about whether big5-uao or big5-hkscs is preferable for .tw
content based on weighing the statistics (i.e. give minus scores to the
190 sites that would yield U+FFFD with HKSCS), U+FFFD rendered as
fullwidth space might actually make me think big5-hkscs is better than
unaltered big5-uao for .tw content.

(12/04/22 8:23), Yuan Chao wrote:
> I'm an experimental high energy physicist. The best way to resolve a
> debating and validate a theory is to do experiment and measure it. :)

To be fair, I think what Philip has done (crawling .hk and .tw) is
already a fine experiment, although I have limited confidence in the
conclusion, which was

(12/04/19 1:01), Philip Jägenstedt wrote:
> Using Big5-UAO for Taiwanese sites would give mixed results. Correctly
> encoded Big5-UAO is very rare, so the tested mapping (Firefox)
> introduces almost as many user-visible misencodings as it fixes and
> masks many others.

partly because some of the sites that this fix are in Japanese, and I am
biased towards them. :p

(12/04/21 22:26), Yuan Chao wrote:
> I look up the market share on browser in HK: it's ~50% for IE, ~23%
> for Chrome, ~18% for Firefox (even much higher than Taiwan) and ~10%
> for Safari.
> Maybe this is the reason:

I think the discussion in this link gives us good confidence that a
zh-HK browser should do big5-hkscs by default when handling big5.

(12/04/21 5:32), Philip Jägenstedt wrote:
> 在這種情況下,我覺得嘗試跟受影響的網站聯繫還是有希望。

我們聯絡 ptt.cc 到目前為止都沒有回應,其他那些比較小的站……

我會再聯絡 PTT 的人,但是其他的別找我 :p

> 反正這是唯一的辦法能夠讓香港和國際的用戶也看得到。

目前的話香港和國際使用者也可以用 Firefox。

(12/04/21 19:55), Philip Jägenstedt wrote:
> Perhaps there does exist a non-HKSCS mapping that would work better
> than Firefox's Big5 for Taiwan sites, but I'm really not sure how to
> define it. One would probably need to figure out which OS and fonts
> were used to produce the content and base the mapping on that.

Yeah, I have no idea where those big5-2003 content is from. Is there a
Linux distribution that interpret big5 as what the big5-2003 standard
syas? Or is there a UAO version that's closer to big5-2003?

> Still, Big5 should certainly be a synonym for Big5-HKSCS by default,
> so that other mapping would have be either locale-dependent (i.e. not
> work for me) or depend on sniffing.

Well, for non-zh-HK users I think this is still debatable. I don't read
Cantonese but I read Japanese so this is like a 0 (big5-hkscs) vs. 8
(big5-uao) for me, based on your data. I think there might be more
people like me than the contrary in the world (this also depends on how
much people get annoyed by mojibake as compared to U+FFFD), but I am not

(12/04/22 1:07), Philip Jägenstedt wrote:
> We will of course implement whatever the spec eventually says, and my
> objective here is to make the spec mappings work well for real-world
> content.

I have no objection if the spec says big5 should be treated as
big5-hkscs, but even if the spec does say so and even if I really
thought big5-hkscs is better, I still don't think I would be able to
convince the Mozilla Taiwan Community that this is where we should go.
(I should note again the fact that the default encoding of zh-TW Firefox
is UTF-8, not what the HTML spec says (big5) and what other browsers do,
and this certainly has a bigger impact on market share.)

Received on Sunday, 22 April 2012 18:44:17 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 15:46:35 UTC