W3C home > Mailing lists > Public > public-html-ig-zh@w3.org > April 2012

Re: 臺灣和香港Big5 HKSCS vs UAO分析和結論

From: Philip Jägenstedt <philipj@opera.com>
Date: Sun, 22 Apr 2012 11:05:33 +0200
To: "Yuan Chao" <yuanchao@gmail.com>
Cc: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, "Chinese HTML Interest Group" <public-html-ig-zh@w3.org>
Message-ID: <op.wc5svjstsr6mfa@localhost.localdomain>
On Sun, 22 Apr 2012 02:23:12 +0200, Yuan Chao <yuanchao@gmail.com> wrote:

> On Sun, Apr 22, 2012 at 1:07 AM, Philip Jägenstedt <philipj@opera.com>  
> wrote:
>>> Unlike ISO-2022-JP which has a very clear states definition, Big5 has
>>> no error handling at all. (Just recall that Kenny was asking about
>>> this about a year ago on this ML.) A visible character is very useful
>>> instead of a fullwidth space, which just hides things away.
>> <http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#big5>  
>> defines the
>> error handling. However, it can probably be improved, see
>> <https://www.w3.org/Bugs/Public/show_bug.cgi?id=16771>.
> Wondering how this definition comes?

Anne specified something similar to other multi-byte encodings, I think.  
One main goal is to not consume following ASCII characters after an error,  
but as you can see the current solution can sometimes instead break a  
Chinese character following an error.

> http://lists.w3.org/Archives/Public/public-html-ig-zh/2011Aug/0052.html

> I didn't see any reply to Kenny's request since.

I didn't see that at the time, but it seems like the new spec should  
address this. I suggest discussing the error handling of Big5 in the bug I  
filed, input from someone with more experience would be helpful.

> For people starts using big5 since the DOS era, one should be used to
> the garbled characters due to conflicts with (ext.) ASCII control
> codes and tables. This is the "feature" of big5. hahaha... Also a good
> "error message".
>> How U+FFFD is rendered appears to be a font issue, I presume you don't  
>> mean
>> that random incorrect characters is preferable.
> The current solution seems to take all PAU as error. I don't prefer it.

The mapping in the spec doesn't use any PUA code points, are you  
suggesting that it should?

>>>>>> On Wed, 18 Apr 2012 22:05:22 +0200, Kang-Hao (Kenny) Lu
>>>>>>> 提供一點考古方向:有些的編碼看起來是 big5-2003[1]、、、、、囧
>>>>>>> 6. http://domestic.mytour.com.tw/list.asp?id=721

>>>>>>> hkscs: 不捨結束此行精采假期、踏上歸途<U+FFFD �>視情況休息<br>18:30~
>>>>>>> uao:   不捨結束此行精采假期、踏上歸途<U+8FF3  
>>>>>>> 迳>視情況休息<br>18:30~
>>>>>>> 84B3 在 big5-2003 是 U+F0E0(PUA),在 Windows 上看起來是 U+2192(→
>>>>>>> RIGHTWARDS ARROW),但是兩個字形(glyph)並不一樣。
>>>>>> 有可能,不過<U+3001 IDEOGRAPHIC COMMA 、>或者<U+FF0C FULLWIDTH  
>>>>>> COMMA ,>好像更好。
>>>>> I would tend to "→" here. (as supply info, we don't use comma as
>>>>> parentheses)
>>>> It's mostly <http://www.wintan.com.tw/service_06_08.htm> that made me
>>> Oh. For this example, it's even more obvious that "→" makes sense.
>>> It tells you to look in to the menu bar for [證券帳務] menu item and
>>> *then* click on [庫存查詢] sub-menu. A "、" makes no sense at all!
>> In  
>> <http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0044.html>
>> you said that "、" was very likely, but if you're sure it should be "→"  
>> then
>> it looks like all 84B3 might be the same, which seems a lot saner.
> That's before Kenny's "interpretation". Don't you agree "→" makes more
> sense here? As I said, I'm neutral and support for the best.

Yes, you are probably right, I didn't actually read the content when  

>>>>>>> 我不知道該說什麼才好了,感覺為 Big5-UAO 把 big5-2003  
>>>>>>> 的東西加回去一些可
>>>>>>> 以解決很大部份,另外,上面這些字都不是日文漢字,所以也不影響我對  
>>>>>>> Big5-
>>> I tend to agree with Kenny's view here.
>> One of you will have to explain exactly what should be done, how should
>> Firefox's mappings be modified to make better sense?
> I think you understand how community does things. We can try to bring
> up this and call for people's help.

Do you know of other places than this list where it would be helpful to  
ask about these issues?

>>>>>>> UAO 的要求  
>>>>>>> :p,有人知道這部份的編碼對應是在可以動手術的範圍還是不行?
>>>>>> 按照上面的,用Big5-2003並不是很完美的。MozTW的映射好像不是完全可靠,所以我不知道該根據什麼去定義Big5-UAO。
>>>>>> 問題的範圍畢竟是0.043%的臺灣網頁的幾個字符。現代的瀏覽器只有Firefox能顯示,而且他們的映射還造成別的問題……
>>> Unfortunately it cause some problem for non-native Chinese readers. :)
>> Certainly it's a problem for all readers of Chinese that random  
>> characters
>> show up where they don't belong?
> Emm... Here you think the current firefox solution is not perfect and
> the needs in Taiwan is negligible so it's better to use big5-hkscs to
> replace the big5 (seems to be CP950?)? I'm an experimental high energy
> physicist. The best way to resolve a debating and validate a theory is
> to do experiment and measure it. :) Maybe you can just implement it in
> Opera and make a survey to see how both HK and Taiwan users appreciate
> it?

It will definitely be an improvement for Opera since HKSCS will start  
working and UAO has never worked, but if there's something even better we  
could do I'd really prefer that. A better test would be to see the  
reactions if Firefox changed, but that's not an experiment I can run :)

>>>>>> 在這種情況下,我覺得嘗試跟受影響的網站聯繫還是有希望。反正這是唯一的辦法能夠讓香港和國際的用戶也看得到。
> Still as mentioned, HK users overwrite "big5-hkscs" as "big5". It's
> their government's choice to create the inconvenience to "encourage"
> people to move to unicode.
> http://my.opera.com/community/forums/topic.dml?id=191245

> It took quite long time for Yahoo! Taiwan to move to unicode. Pushing
> big5-hkscs to replace big5 in w3c would have profound effect. I only
> ask for not breaking my current usage. Though I'd be happy to help to
> put the major variants of big5 to w3c. (it's very little info here
> http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#big5)

Which variants do you think should be specified and what should trigger  
them? Am I correct to assume that Firefox is the only current browser that  
*doesn't* break your current usage?

Philip Jägenstedt
Core Developer
Opera Software
Received on Sunday, 22 April 2012 09:06:15 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 15:46:35 UTC