臺灣和香港Big5 HKSCS vs UAO分析和結論

<https://gitorious.org/whatwg/big5/blobs/master/hkscs-vs-uao.txt>

Analysis of the HKSCS vs UAO samples from .hk and .tw sites.

The raw data collected is available at:

http://html5.org/temp/hk-data.tar.gz (199M)
SHA1: 26b5af227bd0c72280aeeba39b22d712fa8d6cae

http://html5.org/temp/tw-data.tar.gz (708M)
SHA1: 555c3a9dce5f93d00e9ae47e901091f6140bce52

The .content files for resources without 'text/html' in Content-Type  
(mostly PDF) were removed to save space.

== Hong Kong ==

29396 pages

4627 pages labeled as Big5 (16% of total)
  • 100 of those labeled as Big5-HKSCS (not tested further)

88 pages with ambiguous HKSCS/UAO mappings:
  • 64 pages depend on HKSCS (1.4% of Big5, 0.22% of total)
    • Mostly correct Hong Kong / Cantonese usage
    • Some typos using similar-looking characters
    • 2 pages with simplified Chinese
  • 3 pages depend on UAO (0.065% of Big5, 0.010% of total)
    • 2 pages with <U+2665 ♥>
    • 1 page with simplified Chinese
  • 21 pages with mixed/broken encodings

Using Big5-HKSCS would be a net improvement for Hong Kong sites.

== Taiwan ==

109298 pages

34638 pages labeled as Big5 (32% of total)
  • None of those labeled as Big5-HKSCS

345 pages with ambiguous HKSCS/UAO mappings:
  • 47 pages depend on UAO (0.13% of Big5, 0.043% of total)
    • 8 pages with Japanese
    • 6 pages with ® or ™
    • 4 page with Latin script, e.g. Moët and München
    • 5 page with <U+2661 ♡> or <U+2665 ♥>
    • Some typos using simplified Chinese
    • Few deliberate uses of uncommon traditional Chinese characters
  • 298 pages with mixed/broken encodings
    • 190 pages that would yield U+FFFD with HKSCS, but instead produces  
bogus Chinese characters using UAO, some of them user-visible:
      • http://domestic.mytour.com.tw/list.asp?id=721 (迳)
      • http://edu.uuu.com.tw/events/090619_ocpsummer_blueshop.htm (轩)
      •  
http://hi-taiwan.ecserver.com.tw/eip/front/bin/ptdetail.phtml?Part=teams0088  
(启)
      • http://oa.mingdao.edu.tw/~foo/www9/fenyes/h41.htm (财)
      • http://service.cph.com.tw/act/ps921203/proudect01-12.htm (财轩)
      • http://w3.csmu.edu.tw/~jjyang/ (汹刍脉)
      • http://www.be-wells.com.tw/ascendancy/ascendancy_SE.php?page=5 (时)
      • http://www.brain.com.tw/lecture/sale/sale_04.htm (阵)
      • http://www.chimei.com.tw/en/news-detail.asp?news_id=12 (毕)
      • http://www.flag.com.tw/book/5105.asp?bokno=FT476 (钉)
      • http://www.goprint.com.tw/draw.asp (妇)
      • http://www.iiiedu.org.tw/ites/PDPM.htm (贯)
      • http://www.kham.com.tw/ad.asp?P1=0000008355 (财)
      • http://www.misterdonut.com.tw/info/news.asp?id=267 (讫)
      • http://www.misterdonut.com.tw/info/news.asp?id=308 (讫)
      • http://www.muonline.com.tw/Guide/GameSystem/07_pvp.asp (轩)
      • http://www.nacs.gov.tw/01_about/00_about_page.asp?ID=JNNORPIQJNMMK  
(枭)
      • http://www.nca.org.tw/chhtml/newsdetail.asp?NewsID=933&NewsGroup=4  
(围轩)
      • http://www.neweb.com.tw/neweb-G_080808.htm (阵)
      • http://www.nordic.com.tw/client/festival/food02_3.htm (财)
      • http://www.nordic.com.tw/client/festival/food02_5.htm (财)
      • http://www.ogilvy.com.tw/Works/CaseContent.asp?serial=71 (迳)
      • http://www.pccu.edu.tw/intl/page/english/english.htm (丗)
      • http://www.pycnogenol.com.tw/info.htm (钓)
      • http://www.songyan.com.tw/distribution.html (迳)
      • http://www.srbook.com.tw/show_book.htm?wno=9868017645 (间)
      • http://www.transglobe.com.tw/product/product-insurance-DSC.shtml  
(贯)
      • http://www.ukeas.com.tw/postgrad/university/exeter.htm (财)
      • http://www.wintan.com.tw/service_06_08.htm (迳)
      • https://freenet.smartnet.com.tw/product-item.php?sn=9322 (财轩)

Using Big5-UAO for Taiwanese sites would give mixed results. Correctly  
encoded Big5-UAO is very rare, so the tested mapping (Firefox) introduces  
almost as many user-visible misencodings as it fixes and masks many others.

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Wednesday, 18 April 2012 17:02:01 UTC