- From: Philip Jägenstedt <philipj@opera.com>
- Date: Wed, 18 Apr 2012 19:01:39 +0200
- To: "Chinese HTML Interest Group" <public-html-ig-zh@w3.org>
- Cc: "Anne van Kesteren" <annevk@opera.com>
<https://gitorious.org/whatwg/big5/blobs/master/hkscs-vs-uao.txt> Analysis of the HKSCS vs UAO samples from .hk and .tw sites. The raw data collected is available at: http://html5.org/temp/hk-data.tar.gz (199M) SHA1: 26b5af227bd0c72280aeeba39b22d712fa8d6cae http://html5.org/temp/tw-data.tar.gz (708M) SHA1: 555c3a9dce5f93d00e9ae47e901091f6140bce52 The .content files for resources without 'text/html' in Content-Type (mostly PDF) were removed to save space. == Hong Kong == 29396 pages 4627 pages labeled as Big5 (16% of total) • 100 of those labeled as Big5-HKSCS (not tested further) 88 pages with ambiguous HKSCS/UAO mappings: • 64 pages depend on HKSCS (1.4% of Big5, 0.22% of total) • Mostly correct Hong Kong / Cantonese usage • Some typos using similar-looking characters • 2 pages with simplified Chinese • 3 pages depend on UAO (0.065% of Big5, 0.010% of total) • 2 pages with <U+2665 ♥> • 1 page with simplified Chinese • 21 pages with mixed/broken encodings Using Big5-HKSCS would be a net improvement for Hong Kong sites. == Taiwan == 109298 pages 34638 pages labeled as Big5 (32% of total) • None of those labeled as Big5-HKSCS 345 pages with ambiguous HKSCS/UAO mappings: • 47 pages depend on UAO (0.13% of Big5, 0.043% of total) • 8 pages with Japanese • 6 pages with ® or ™ • 4 page with Latin script, e.g. Moët and München • 5 page with <U+2661 ♡> or <U+2665 ♥> • Some typos using simplified Chinese • Few deliberate uses of uncommon traditional Chinese characters • 298 pages with mixed/broken encodings • 190 pages that would yield U+FFFD with HKSCS, but instead produces bogus Chinese characters using UAO, some of them user-visible: • http://domestic.mytour.com.tw/list.asp?id=721 (迳) • http://edu.uuu.com.tw/events/090619_ocpsummer_blueshop.htm (轩) • http://hi-taiwan.ecserver.com.tw/eip/front/bin/ptdetail.phtml?Part=teams0088 (启) • http://oa.mingdao.edu.tw/~foo/www9/fenyes/h41.htm (财) • http://service.cph.com.tw/act/ps921203/proudect01-12.htm (财轩) • http://w3.csmu.edu.tw/~jjyang/ (汹刍脉) • http://www.be-wells.com.tw/ascendancy/ascendancy_SE.php?page=5 (时) • http://www.brain.com.tw/lecture/sale/sale_04.htm (阵) • http://www.chimei.com.tw/en/news-detail.asp?news_id=12 (毕) • http://www.flag.com.tw/book/5105.asp?bokno=FT476 (钉) • http://www.goprint.com.tw/draw.asp (妇) • http://www.iiiedu.org.tw/ites/PDPM.htm (贯) • http://www.kham.com.tw/ad.asp?P1=0000008355 (财) • http://www.misterdonut.com.tw/info/news.asp?id=267 (讫) • http://www.misterdonut.com.tw/info/news.asp?id=308 (讫) • http://www.muonline.com.tw/Guide/GameSystem/07_pvp.asp (轩) • http://www.nacs.gov.tw/01_about/00_about_page.asp?ID=JNNORPIQJNMMK (枭) • http://www.nca.org.tw/chhtml/newsdetail.asp?NewsID=933&NewsGroup=4 (围轩) • http://www.neweb.com.tw/neweb-G_080808.htm (阵) • http://www.nordic.com.tw/client/festival/food02_3.htm (财) • http://www.nordic.com.tw/client/festival/food02_5.htm (财) • http://www.ogilvy.com.tw/Works/CaseContent.asp?serial=71 (迳) • http://www.pccu.edu.tw/intl/page/english/english.htm (丗) • http://www.pycnogenol.com.tw/info.htm (钓) • http://www.songyan.com.tw/distribution.html (迳) • http://www.srbook.com.tw/show_book.htm?wno=9868017645 (间) • http://www.transglobe.com.tw/product/product-insurance-DSC.shtml (贯) • http://www.ukeas.com.tw/postgrad/university/exeter.htm (财) • http://www.wintan.com.tw/service_06_08.htm (迳) • https://freenet.smartnet.com.tw/product-item.php?sn=9322 (财轩) Using Big5-UAO for Taiwanese sites would give mixed results. Correctly encoded Big5-UAO is very rare, so the tested mapping (Firefox) introduces almost as many user-visible misencodings as it fixes and masks many others. -- Philip Jägenstedt Core Developer Opera Software
Received on Wednesday, 18 April 2012 17:02:01 UTC