Re: 求助:關於Big5和Big5-HKSCS的問題

On Fri, 13 Apr 2012 11:56:39 +0200, Kang-Hao (Kenny) Lu  
<kennyluck@csail.mit.edu> wrote:

> (12/04/12 17:09), Yuan Chao wrote:

> 這裡一直就是兩個很有關系但是不是直接相關的問題:
>
> 一、台灣版的瀏覽器(zh-TW)碰到 <meta charset="big5"> 到底該怎麼處理?
>
> A. 使用現況(CP950?)
>
> B. 使用 'big5-uao' 解碼(Firefox)
>
> C. 使用 'big5-hkscs'
>
> ... 的選項
>
>
> 二、使用哪種解碼映射可以讓台灣使用者看到最多正確內容?
>
>
> 我覺得不管怎麼樣,問題二都是一個相當科學的考古問題,而我覺得問題一使用問
> 題二的答案應該是好的。比如說,我覺得 <meta charset="big5"> 就至少要解碼
> 'big5-uao' 和 'big5-hkscs' 的交集,這至少包括平假名和片假名。

謝謝Kenny,你總結得很好。我也認為問題二是最關鍵的,因此又一次進行了研究……

In English, since the methods used will be of interest also to Anne van  
Kesteren and possibly others.

My goal was to find a big and representative sample of Big5 usage on  
Taiwan. Alexa's top million sites [1] lists 2951 .tw sites. Using  
"site:example.com.tw" searches for all of those using the Bing API [2]  
generated a list of ~120k URLs.[3] ~116k of those were successfully  
fetched using a Python script.[4] Another script [5] identified ~38k of  
them labeled as Big5 and decoded them using the spec algorithm to collect  
statistics. A final script [6] filtered out ~36k pages with low error  
rates to exclude misencodings, which is as close to a random sample of  
Taiwanese Big5 pages that I can get.

The same script identified the pages that would yield different results  
with the spec mapping (~HKSCS) and the firefox mapping (~UAO), finding 294  
such pages. Manually removing obvious misencoded nonsense left 190 which  
will need more analysis.[7] My initial impression is that a lot of these  
pages are likely to be garbage, but there are some which are obviously  
Big5-UAO...

[1] http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

[2] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.py

[3] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.txt

[4] https://gitorious.org/whatwg/big5/blobs/master/get-urls.py

[5] https://gitorious.org/whatwg/big5/blobs/master/tw-json.py

[6] https://gitorious.org/whatwg/big5/blobs/master/tw-analyze.py

[7] https://gitorious.org/whatwg/big5/blobs/master/big5-hkscs-vs-uao.txt


-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Sunday, 15 April 2012 19:12:37 UTC