- From: Philip Jägenstedt <philipj@opera.com>
- Date: Sun, 15 Apr 2012 21:11:49 +0200
- To: public-html-ig-zh@w3.org, "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, "Anne van Kesteren" <annevk@opera.com>
- Cc: "Yuan Chao" <yuanchao@gmail.com>, "Timothy Chien" <timdream@gmail.com>
On Fri, 13 Apr 2012 11:56:39 +0200, Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu> wrote: > (12/04/12 17:09), Yuan Chao wrote: > 這裡一直就是兩個很有關系但是不是直接相關的問題: > > 一、台灣版的瀏覽器(zh-TW)碰到 <meta charset="big5"> 到底該怎麼處理? > > A. 使用現況(CP950?) > > B. 使用 'big5-uao' 解碼(Firefox) > > C. 使用 'big5-hkscs' > > ... 的選項 > > > 二、使用哪種解碼映射可以讓台灣使用者看到最多正確內容? > > > 我覺得不管怎麼樣,問題二都是一個相當科學的考古問題,而我覺得問題一使用問 > 題二的答案應該是好的。比如說,我覺得 <meta charset="big5"> 就至少要解碼 > 'big5-uao' 和 'big5-hkscs' 的交集,這至少包括平假名和片假名。 謝謝Kenny,你總結得很好。我也認為問題二是最關鍵的,因此又一次進行了研究…… In English, since the methods used will be of interest also to Anne van Kesteren and possibly others. My goal was to find a big and representative sample of Big5 usage on Taiwan. Alexa's top million sites [1] lists 2951 .tw sites. Using "site:example.com.tw" searches for all of those using the Bing API [2] generated a list of ~120k URLs.[3] ~116k of those were successfully fetched using a Python script.[4] Another script [5] identified ~38k of them labeled as Big5 and decoded them using the spec algorithm to collect statistics. A final script [6] filtered out ~36k pages with low error rates to exclude misencodings, which is as close to a random sample of Taiwanese Big5 pages that I can get. The same script identified the pages that would yield different results with the spec mapping (~HKSCS) and the firefox mapping (~UAO), finding 294 such pages. Manually removing obvious misencoded nonsense left 190 which will need more analysis.[7] My initial impression is that a lot of these pages are likely to be garbage, but there are some which are obviously Big5-UAO... [1] http://s3.amazonaws.com/alexa-static/top-1m.csv.zip [2] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.py [3] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.txt [4] https://gitorious.org/whatwg/big5/blobs/master/get-urls.py [5] https://gitorious.org/whatwg/big5/blobs/master/tw-json.py [6] https://gitorious.org/whatwg/big5/blobs/master/tw-analyze.py [7] https://gitorious.org/whatwg/big5/blobs/master/big5-hkscs-vs-uao.txt -- Philip Jägenstedt Core Developer Opera Software
Received on Sunday, 15 April 2012 19:12:37 UTC