W3C home > Mailing lists > Public > public-html-ig-zh@w3.org > April 2012

Re: 求助:關於Big5和Big5-HKSCS的問題

From: Philip Jägenstedt <philipj@opera.com>
Date: Sun, 15 Apr 2012 21:11:49 +0200
To: public-html-ig-zh@w3.org, "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, "Anne van Kesteren" <annevk@opera.com>
Cc: "Yuan Chao" <yuanchao@gmail.com>, "Timothy Chien" <timdream@gmail.com>
Message-ID: <op.wctl9ziosr6mfa@localhost.localdomain>
On Fri, 13 Apr 2012 11:56:39 +0200, Kang-Hao (Kenny) Lu  
<kennyluck@csail.mit.edu> wrote:

> (12/04/12 17:09), Yuan Chao wrote:

> 這裡一直就是兩個很有關系但是不是直接相關的問題:
> 一、台灣版的瀏覽器(zh-TW)碰到 <meta charset="big5"> 到底該怎麼處理?
> A. 使用現況(CP950?)
> B. 使用 'big5-uao' 解碼(Firefox)
> C. 使用 'big5-hkscs'
> ... 的選項
> 二、使用哪種解碼映射可以讓台灣使用者看到最多正確內容?
> 我覺得不管怎麼樣,問題二都是一個相當科學的考古問題,而我覺得問題一使用問
> 題二的答案應該是好的。比如說,我覺得 <meta charset="big5"> 就至少要解碼
> 'big5-uao' 和 'big5-hkscs' 的交集,這至少包括平假名和片假名。


In English, since the methods used will be of interest also to Anne van  
Kesteren and possibly others.

My goal was to find a big and representative sample of Big5 usage on  
Taiwan. Alexa's top million sites [1] lists 2951 .tw sites. Using  
"site:example.com.tw" searches for all of those using the Bing API [2]  
generated a list of ~120k URLs.[3] ~116k of those were successfully  
fetched using a Python script.[4] Another script [5] identified ~38k of  
them labeled as Big5 and decoded them using the spec algorithm to collect  
statistics. A final script [6] filtered out ~36k pages with low error  
rates to exclude misencodings, which is as close to a random sample of  
Taiwanese Big5 pages that I can get.

The same script identified the pages that would yield different results  
with the spec mapping (~HKSCS) and the firefox mapping (~UAO), finding 294  
such pages. Manually removing obvious misencoded nonsense left 190 which  
will need more analysis.[7] My initial impression is that a lot of these  
pages are likely to be garbage, but there are some which are obviously  

[1] http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

[2] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.py

[3] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.txt

[4] https://gitorious.org/whatwg/big5/blobs/master/get-urls.py

[5] https://gitorious.org/whatwg/big5/blobs/master/tw-json.py

[6] https://gitorious.org/whatwg/big5/blobs/master/tw-analyze.py

[7] https://gitorious.org/whatwg/big5/blobs/master/big5-hkscs-vs-uao.txt

Philip Jägenstedt
Core Developer
Opera Software
Received on Sunday, 15 April 2012 19:12:37 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:43:50 UTC