Re: 臺灣和香港Big5 HKSCS vs UAO分析和結論

On Sat, 21 Apr 2012 03:49:42 +0200, Yuan Chao <yuanchao@gmail.com> wrote:

> 2012/4/21 Philip Jägenstedt <philipj@opera.com>:
>>> (12/04/19 1:01), Philip Jägenstedt wrote:
>
>>>>  • 298 pages with mixed/broken encodings
>>>>   • 190 pages that would yield U+FFFD with HKSCS, but instead produces
>>>> bogus Chinese characters using UAO, some of them user-visible:
> Phillip, which OS are you using? To me, they are all visible as
> squares with code
> ID in it with HKSCS under Ubuntu!

Ubuntu. For testing user-visibility I was using Firefox's Big5. The best  
existing approximation of the suggested Big5-HKSCS is Opera's, it's  
possible that other browsers map these things to something other than  
U+FFFD. I've also noticed that the rendering of U+FFFD appears to be  
different in different fonts, in particular some Chinese fonts just  
renders it as a fullwidth space, at least in Opera.

>> On Wed, 18 Apr 2012 22:05:22 +0200, Kang-Hao (Kenny) Lu
>>> 提供一點考古方向:有些的編碼看起來是 big5-2003[1]、、、、、囧
>
>>> 6. http://domestic.mytour.com.tw/list.asp?id=721

>>> hkscs: 不捨結束此行精采假期、踏上歸途<U+FFFD �>視情況休息<br>18:30~
>>> uao:   不捨結束此行精采假期、踏上歸途<U+8FF3 迳>視情況休息<br>18:30~
>>>
>>> 84B3 在 big5-2003 是 U+F0E0(PUA),在 Windows 上看起來是 U+2192(→
>>> RIGHTWARDS ARROW),但是兩個字形(glyph)並不一樣。
>> 有可能,不過<U+3001 IDEOGRAPHIC COMMA 、>或者<U+FF0C FULLWIDTH COMMA  
>> ,>好像更好。
> I would tend to "→" here. (as supply info, we don't use comma as  
> parentheses)

It's mostly <http://www.wintan.com.tw/service_06_08.htm> that made me  
think that this must be 、 but maybe → can make sense there as well?

>>>>     • http://www.goprint.com.tw/draw.asp (妇)
>>> 「妇」83FC → U+F08C(PUA)→ U+2776(❶ DINGBAT NEGATIVE CIRCLED DIGIT  
>>> ONE)
>> 不太合適,大概是亂碼。
> Actually Google say it's correct!
> https://www.google.com/search?ie=UTF-8&oe=UTF-8&q="繪圖軟體教學"  
> "photoshop基礎教學"
>
> =>繪圖軟體教學—Photoshop實作教學(一)

Huh, OK.

All in all, it seems to like most of the examples are redundant bullet  
points, typos and never very important. U+FFFD rendered as a fullwidth  
space seems like it wouldn't really lose any important information.

>>>> Using Big5-UAO for Taiwanese sites would give mixed results. Correctly
>>>> encoded Big5-UAO is very rare, so the tested mapping (Firefox)
>>>> introduces almost as many user-visible misencodings as it fixes and
>>>> masks many others.
>>>
>>>
>>> 我不知道該說什麼才好了,感覺為 Big5-UAO 把 big5-2003 的東西加回去一些可
>>> 以解決很大部份,另外,上面這些字都不是日文漢字,所以也不影響我對 Big5-
>>> UAO 的要求 :p,有人知道這部份的編碼對應是在可以動手術的範圍還是不行?
>>
>>
>> 按照上面的,用Big5-2003並不是很完美的。MozTW的映射好像不是完全可靠,所以我不知道該根據什麼去定義Big5-UAO。
>>
>> 問題的範圍畢竟是0.043%的臺灣網頁的幾個字符。現代的瀏覽器只有Firefox能顯示,而且他們的映射還造成別的問題……
>>
>> 在這種情況下,我覺得嘗試跟受影響的網站聯繫還是有希望。反正這是唯一的辦法能夠讓香港和國際的用戶也看得到。
> I don't know... to me the original thought of big5-hkscs doesn't seem
> to dominate, and looks
> like big5-uao is not dominate either according to "bing".

I'm not sure what you mean here. HKSCS is dominant in Hong Kong and UAO is  
dominant in Taiwan. The difference is just that using HKSCS fixes a lot  
more Hong Kong sites than UAO fixes

> (I just
> realize that our
> "frequent-visit-sites" with big5-uao are not under "*.tw". Some of my
> treasure sites can
> only be found in internet-archive now) To my surprise is that there
> are quite a lot of cases can
> be explained with big5-2003 PAU though. (probably to Kenny too) At
> least HK friends can
> live with a hack in firefox to force big5-hkscs=big5; ie is ok if the
> official patch installed (the
> font with extended glyphs is needed for up to win xp). I'm curious
> about the browser share
> in HK?

If there are popular/important sites using Big5-UAO, I would really  
recommend asking them to either escape conflicting code points as &#1234;  
or to use UTF-8. That would fix the problem for all browsers immediately,  
instead of fixing it only for Taiwan-locale browsers in a few years.

I have no idea about Hong Kong browser market share, but it's a safe bet  
that IE has been the market leader for a long time and probably still is.



-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Saturday, 21 April 2012 11:22:27 UTC