- From: Ambrose LI <ambrose.li@gmail.com>
- Date: Mon, 9 Apr 2012 10:08:38 -0400
- To: Philip Jägenstedt <philipj@opera.com>
- Cc: "public-html-ig-zh@w3.org" <public-html-ig-zh@w3.org>, Øistein E. Andersen <liszt@coq.no>
Hi, I will be commenting on your quoted post ([whatwg] Encoding: big5 and big5-hkscs). But before doing that I’d like to note a couple of observations: 1. It is probably somewhat useful to know that a lot of people (especially non-Mandarin-speakers) don’t actually know how to type Chinese. So there are a couple of common scenarios that you might not expect, some of which could contribute to what we’re seeing: a. The person fires up the character recognition panel in Windows (or brings up a specialized tablet, with its accompanying specialized drivers) and starts writing. Whatever looks correct on the screen (which can be a variant of the intended character, or, of course, a typo) is picked. (Note that in the case of specialized tablets, the driver should but might not necessarily emit standardized characters.) b. The person activates an input method and starts typing from rules that he/she only half knows. c. The person runs special “Chinese language support” software and uses whatever input methods that piece of software provides (instead of what Windows provides). The software scans the screen for possible Chinese characters and if found converts them to glyphs. Obviously, this will, at least sometimes, cause regionally encoded characters (or, in the worst case, VENDOR-specific characters) to become mixed with Unicode. I really don’t want to believe 1(c) is still true, but given that these things are still being sold, I’ll tend to believe such a scenario still exists. 2. Unihan really is not exhaustive (which would help explain why vendor-specific characters still exist). A few months ago I was working on some Cantonese documents which required me to look in my Cantonese reference books; I found that some of the characters listed (that is, on actual printed books published relatively recently) could not be found in the Unihan database. Worse, most of these are not really Cantonese characters per se but actual ancient characters that some have traced to be what some Cantonese sounds are supposed to represent. [...] >> leetm.mingpao.com/cfm/Forum3.cfm?CategoryID=2&TopicID=2720&TopicOrder=Desc&TopicPage=64 >> <0xA1: [('0x8b', '0xf8'), ('0x90', '0x5b')] >> 0xA3: [] >> 0xC6-0xC8: [] > >重唔變晒烏\x8b\xf8縮得就縮 > >有本地產婦出現作動\x90\x5b象亦要輪候五天才可入住私家醫院 龜 is definitely correct (an obvious reference to the idiom 縮頭烏龜). 迹 looks correct. >> board.phonehk.com/archiver/?tid-156148.html >> <0xA1: [('0x9d', '0xeb')] >> 0xA3: [] >> 0xC6-0xC8: [] > >我唔識點係itune度轉mp4呀,解壓之後係.m4a\x9d\xeb,係咪即係呢個?? Yes, 噃 looks correct. >> www.millionbook.net/gd/h/huishuianyangjiumin/qmt/006.htm >> <0xA1: [('0x8f', '0x73'), ('0x8e', '0x4e'), ('0x8e', '0x4e')] >> 0xA3: [] >> 0xC6-0xC8: [] > >那朱媽媽正在廚下催臉水,剛進角門,听得里邊打罵,立住腳,向\x8f\x73子眼里一瞧,探知緣故。 > >‘槐蔭未擎\x8e\x4e鷺足’,是宮槐之下,未列著鷺序\x8e\x4e班,喻未仕也。 Since the page bears an obvious signature of having been machine-converted from simplified Chinese, googling for the same piece of text encoded in simplified Chinese should be useful. Most pages google finds seem to be converted from the same or a similar defective big5 source, but it also found this lib.bgu.edu.cn/websql/date%5CI%5CA2024756.pdf , which gives the “correct” simplified characters for the first sentence as 朱妈妈正在厨下催脸水, 刚进角门, 听得里边打骂 , 立住脚, 隔子眼里一瞧, 探知缘故, For the second sentence, the PDF file gives an unknown character (probably vendor-specific). The text in question is from a (apparently banned) novel from the Qing dynasty, so, yes, it would be normal to see some classical Chinese. However, since the problematic page obviously contains errors (as the first sentence does not in fact match what the PDF file says), I suggest that we drop this page from our consideration. > >> www.toysdaily.com/discuz/forum-24-2.html >> <0xA1: [] >> 0xA3: [] >> 0xC6-0xC8: [('0xc7', '0x55')] > >This is "[個人收藏]一抽即中 (One Piece Q版盒蛋 ~ 海底\xc7\x55樂園)" which >links to this item: > >http://www.toysdaily.com/discuz/thread-180080-1-2.html Yes, の is correct, as shown here: http://www.toysdaily.com/discuz/thread-179839-1-1.html >> forum.mingpao.com/cfm/Forum3.cfm?OwnerID=1&CategoryID=3&TopicID=524&Page=5 >> <0xA1: [('0x8e', '0xe0'), ('0x9d', '0xf8'), ('0x9d', '0xf8'), ('0x9d', >> '0xf8')] >> 0xA3: [] >> 0xC6-0xC8: [] > >The source is a post by "又一痛\x8e\xe0": > >西方傳媒每逢見到這種新聞,都雀躍萬分,跟住\x9d\xf8反中亂港人仕就隨之而起舞,抺黑中國為首任.中國的發展是剛起步,一些黑暗的事一定會發生,我們不要以為西方普通的事在中國就一定會有,唔該俾\x9d\xf8耐性對中國,唔好一有事就跳出來協助西方人抺黑中國啦.唔通類似這些事情在一些民主國家無發生咩,例如印度,菲律賓等國家,為甚麼那班抺黑中國的人不提一\xfa\xef呢.公道\x9d\xf8好唔好,你都是中國人來的. 脚 and 啲 are definitely correct. 吓 is also correct for this discussion but it’s actually a “spelling mistake” (it should be just 下, no Cantonese character necessary) >> www30.discuss.com.hk/archiver/?tid-9026420.html >> <0xA1: [('0x9d', '0xef')] >> 0xA3: [] >> 0xC6-0xC8: [] > >師兄你\x9d\xef表達能力仲驚人, 一語道破成件事. Yes, 嘅 is correct >> www28.discuss.com.hk/viewthread.php?tid=7539844&extra=page%3D1&page=10 >> <0xA1: [('0x9d', '0xf7')] >> 0xA3: [] >> 0xC6-0xC8: [] > >一開始用斯路在悟空下方出龜波,斯路死\x9d\xf7悟飯就爆氣,狂出龜波,如果死埋有神龍。 > >\x9d\xf7 also appeared in another source: > >> www.hacken.cc/bbs/thread-318592-6-1.html >> <0xA1: [('0x9d', '0xf7'), ('0x89', '0x59'), ('0x89', '0x72')] >> 0xA3: [] >> 0xC6-0xC8: [] > >This is from a comment in mixed simplified and traditional Chinese. First >the traditional bit: > >我發言後就彈\x9d\xf7依句: Yes, 咗 looks correct. >This is the simplified bit: > >对不起,您暂时\xfc\xd3法\x89\x59言,可能是以下原因 >1,您申请加入该群,正在等待验证通过。 >2,您已\x89\x72退出该群。 Yes, definitely 无, 发, and 经 无 is actually a rare non-simplified character. I can’t comment on the others but it’s true that HKSCS contains some simplified characters. >> www28.discuss.com.hk/viewthread.php?tid=7319244&extra=page%3D1&page=10 >> <0xA1: [('0xa0', '0x4f')] >> 0xA3: [] >> 0xC6-0xC8: [] > >『飢餓穴』是臨食\x0a\x4f之前十五分鐘去按呢! Definitely 嘢 (“thing”). 食嘢 means “to eat” (and in this context it is the gerund “eating” – “15 minutes before eating”). >> www.fhs.gov.hk/tc_chi/health_info/class_life/child/child.html >> <0xA1: [('0x8f', '0xc0')] >> 0xA3: [] >> 0xC6-0xC8: [] Yes, definitely 衞 >> www.books.com.tw/exep/prod/books/editorial/publisher_booklist.php?pubid=sharppnt&qseries=sharppnt9B05 >> <0xA1: [] >> 0xA3: [] >> 0xC6-0xC8: [('0xc7', '0x5c'), ('0xc7', '0x66'), ('0xc7', '0x5c'), >> ('0xc7', >> '0x66')] > >These are hiragana in 柴門ふみ which is simply the name of a Japanese >author: http://en.wikipedia.org/wiki/Fumi_Saimon > >\xc7\x5c => > >opera-hk: U+3075 ふ >firefox: U+3075 ふ >chrome: U+F72B >firefox-hk: U+3075 ふ >opera: U+3075 ふ >chrome-hk: U+3075 ふ >internetexplorer: U+F72B > >\xc7\x66 => > >opera-hk: U+307F み >firefox: U+307F み >chrome: U+F735 >firefox-hk: U+307F み >opera: U+307F み >chrome-hk: U+307F み >internetexplorer: U+F735 > >U+F72B and U+F735 are in the PUA, so U+307F and U+3075 are correct. > >Winners: opera-hk, firefox, firefox-hk, opera, chrome-hk > > >== Mixed encodings and other nonsense == > Personally, I’d say mixed encodings inside <script> or <style> suggest that the file started as gbk or big5 and then machine-converted to utf-8. (Or, in the case of utf-8 filename, it’s just dynamically generated in an environment where the developers are still using gb/big5 but the filesystem is using utf8.) There’s nothing we can do. In the case of forums or other user generated content, I strongly suspect something along the lines of 1(c) above at work. Things can get even worse if the forum then converts the regional encoding to utf-8, sometimes by way of an incorrect intermediate regional encoding. -- cheers, -ambrose <http://gniw.ca>
Received on Monday, 9 April 2012 14:09:08 UTC