Re: 臺灣和香港Big5 HKSCS vs UAO分析和結論

On Sat, 21 Apr 2012 16:26:07 +0200, Yuan Chao <yuanchao@gmail.com> wrote:

> On Sat, Apr 21, 2012 at 7:21 PM, Philip Jägenstedt <philipj@opera.com>  
> wrote:
>
>>>>>>  • 298 pages with mixed/broken encodings
>>>>>>  • 190 pages that would yield U+FFFD with HKSCS, but instead  
>>>>>> produces
>>>>>> bogus Chinese characters using UAO, some of them user-visible:
>>> Phillip, which OS are you using? To me, they are all visible as
>>> squares with code
>>> ID in it with HKSCS under Ubuntu!
>
>> Ubuntu. For testing user-visibility I was using Firefox's Big5. The best
>> existing approximation of the suggested Big5-HKSCS is Opera's, it's  
>> possible
> Unlike ISO-2022-JP which has a very clear states definition, Big5 has
> no error handling at all. (Just recall that Kenny was asking about
> this about a year ago on this ML.) A visible character is very useful
> instead of a fullwidth space, which just hides things away.

<http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#big5> defines  
the error handling. However, it can probably be improved, see  
<https://www.w3.org/Bugs/Public/show_bug.cgi?id=16771>.

How U+FFFD is rendered appears to be a font issue, I presume you don't  
mean that random incorrect characters is preferable.

>>>> On Wed, 18 Apr 2012 22:05:22 +0200, Kang-Hao (Kenny) Lu
>
>>>>> 提供一點考古方向:有些的編碼看起來是 big5-2003[1]、、、、、囧
>>>>> 6. http://domestic.mytour.com.tw/list.asp?id=721
>>>>> hkscs: 不捨結束此行精采假期、踏上歸途<U+FFFD �>視情況休息<br>18:30~
>>>>> uao:   不捨結束此行精采假期、踏上歸途<U+8FF3 迳>視情況休息<br>18:30~
>>>>>
>>>>> 84B3 在 big5-2003 是 U+F0E0(PUA),在 Windows 上看起來是 U+2192(→
>>>>> RIGHTWARDS ARROW),但是兩個字形(glyph)並不一樣。
>>>>
>>>> 有可能,不過<U+3001 IDEOGRAPHIC COMMA 、>或者<U+FF0C FULLWIDTH COMMA  
>>>> ,>好像更好。
>>> I would tend to "→" here. (as supply info, we don't use comma as
>>> parentheses)
>
>> It's mostly <http://www.wintan.com.tw/service_06_08.htm> that made me  
>> think
>> that this must be 、 but maybe → can make sense there as well?
> Oh. For this example, it's even more obvious that "→" makes sense.
> It tells you to look in to the menu bar for [證券帳務] menu item and
> *then* click on [庫存查詢] sub-menu. A "、" makes no sense at all!

In  
<http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0044.html>  
you said that "、" was very likely, but if you're sure it should be "→"  
then it looks like all 84B3 might be the same, which seems a lot saner.

>>>>> 我不知道該說什麼才好了,感覺為 Big5-UAO 把 big5-2003  
>>>>> 的東西加回去一些可
>>>>> 以解決很大部份,另外,上面這些字都不是日文漢字,所以也不影響我對  
>>>>> Big5-
> I tend to agree with Kenny's view here.

One of you will have to explain exactly what should be done, how should  
Firefox's mappings be modified to make better sense?

>>>>> UAO 的要求 :p,有人知道這部份的編碼對應是在可以動手術的範圍還是不行?
>
>>>> 按照上面的,用Big5-2003並不是很完美的。MozTW的映射好像不是完全可靠,所以我不知道該根據什麼去定義Big5-UAO。
>>>>
>>>> 問題的範圍畢竟是0.043%的臺灣網頁的幾個字符。現代的瀏覽器只有Firefox能顯示,而且他們的映射還造成別的問題……
> Unfortunately it cause some problem for non-native Chinese readers. :)

Certainly it's a problem for all readers of Chinese that random characters  
show up where they don't belong?

>>>> 在這種情況下,我覺得嘗試跟受影響的網站聯繫還是有希望。反正這是唯一的辦法能夠讓香港和國際的用戶也看得到。
>>>
>>> I don't know... to me the original thought of big5-hkscs doesn't seem
>>> to dominate, and looks
>>> like big5-uao is not dominate either according to "bing".
>
>> I'm not sure what you mean here. HKSCS is dominant in Hong Kong and UAO  
>> is
>> dominant in Taiwan. The difference is just that using HKSCS fixes a lot  
>> more
>> Hong Kong sites than UAO fixes
> I look up the market share on browser in HK: it's ~50% for IE, ~23%
> for Chrome, ~18% for Firefox (even much higher than Taiwan) and ~10%
> for Safari.
>
> Maybe this is the reason:
> http://productforums.google.com/forum/#!category-topic/chrome/discuss-chrome/m-rZuk5iAR4
> A simple option for the browser to "not switch to big5 if big5hkscs is
> selected" will do and it's a tunable option for Firefox. Maybe Opera
> can implement this to gain some share? ;) (Opera's share is ~0.4% in
> Taiwan and invisible in HK. ) This would make much more sense to me
> than eliminating other big5 variants than big5hkscs.
>
> http://gs.statcounter.com/#browser-TW-weekly-201201-201216-bar
> http://gs.statcounter.com/#browser-HK-weekly-201201-201216-bar

We will of course implement whatever the spec eventually says, and my  
objective here is to make the spec mappings work well for real-world  
content. In Opera Big5 is currently subset of Big5-HKSCS, so adopting the  
suggested spec mapping can only improve things for us. Only Firefox has  
separate mappings in the sense that would make sense for the spec, both IE  
and Chrome use PUA mappings instead.

>> If there are popular/important sites using Big5-UAO, I would really
>> recommend asking them to either escape conflicting code points as  
>> &#1234; or
>> to use UTF-8. That would fix the problem for all browsers immediately,
>> instead of fixing it only for Taiwan-locale browsers in a few years.
> I don't think either is practical.
> The same argument applies to HK sites too, right? Why don't HK sites
> move to use UTF-8 as their government ask them to? Sites with the
> abilities and visions have moved to UTF-8 already. While the remaining
> ones have either difficulties or simple just don't want to do this.
> Popular/important sites would be even less willing to as they have so
> many users and legacies already.

I certainly think it's worth trying, Kenny told me that ptt.cc has a UTF-8  
proxy of some sort, so they already have part of the infrastructure in  
place. Of course not all sites can be fixed, but given that Big5-UAO  
currently only works in Firefox or on a patched Windows XP I really doubt  
anyone *wants* to keep using it.

The situation for Hong Kong is quite different, since the numbers favor  
making treating Big5 as Big5-HKSCS by default. Remember, 1.4% of .hk Big5  
pages depend on HKSCS, while only 0.13% of .tw sites depend on UAO.

If a way can be found to fix sites using Big5-UAO without modification  
that would be great, but it's not looking very promising.

> Unlike HK, Taiwan government has very loose control here and the big5
> (a de facto) standard is the best example. The choice in Firefox is
> based on the community discussion which would be different to other
> browsers' business model. I personally don't stand for either. Just
> express my personally needs and try to help with pointing some
> directions.

That is much appreciated!

> To me, big5hkscs is an abandoned standard by HK government. Leaving
> the difficulties and conflicts in big5/big5hkscs would be on purpose
> to make people to switch to unicode. (I guess) The best solution to
> the original problem you rose would be to implement an option of "not
> switch to big5 if big5hkscs is selected even the site declares 'big5'
> encoding".

What should the Big5 mapping be? If it is like the conservative Big5 that  
Opera currently supports, that really won't help Taiwan sites and users at  
all. What Firefox does is also not that great, so it would have to be a  
new mapping that no browser has ever supported so far.

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Saturday, 21 April 2012 17:08:03 UTC