求助:關於Big5和Big5-HKSCS的問題

From: Ambrose LI <ambrose.li@gmail.com>
Date: Mon, 9 Apr 2012 10:08:38 -0400
Message-ID: <CADJvFOU7=QiH926+=-WjqiBNUFp-S2o2PZPfqh29wLXUvAxWfQ@mail.gmail.com>
To: Philip Jägenstedt <philipj@opera.com>
Cc: "public-html-ig-zh@w3.org" <public-html-ig-zh@w3.org>, Øistein E. Andersen <liszt@coq.no>

I will be commenting on your quoted post ([whatwg] Encoding: big5 and
big5-hkscs). But before doing that I’d like to note a couple of

1. It is probably somewhat useful to know that a lot of people
(especially non-Mandarin-speakers) don’t actually know how to type
Chinese. So there are a couple of common scenarios that you might not
expect, some of which could contribute to what we’re seeing:

a. The person fires up the character recognition panel in Windows (or
brings up a specialized tablet, with its accompanying specialized
drivers) and starts writing. Whatever looks correct on the screen
(which can be a variant of the intended character, or, of course, a
typo) is picked. (Note that in the case of specialized tablets, the
driver should but might not necessarily emit standardized characters.)

b. The person activates an input method and starts typing from rules
that he/she only half knows.

c. The person runs special “Chinese language support” software and
uses whatever input methods that piece of software provides (instead
of what Windows provides). The software scans the screen for possible
Chinese characters and if found converts them to glyphs. Obviously,
this will, at least sometimes, cause regionally encoded characters
(or, in the worst case, VENDOR-specific characters) to become mixed
with Unicode.

I really don’t want to believe 1(c) is still true, but given that
these things are still being sold, I’ll tend to believe such a
scenario still exists.

2. Unihan really is not exhaustive (which would help explain why
vendor-specific characters still exist). A few months ago I was
working on some Cantonese documents which required me to look in my
Cantonese reference books; I found that some of the characters listed
(that is, on actual printed books published relatively recently) could
not be found in the Unihan database. Worse, most of these are not
really Cantonese characters per se but actual ancient characters that
some have traced to be what some Cantonese sounds are supposed to

>> leetm.mingpao.com/cfm/Forum3.cfm?CategoryID=2&TopicID=2720&TopicOrder=Desc&TopicPage=64
>> <0xA1: [('0x8b', '0xf8'), ('0x90', '0x5b')]
>> 0xA3: []
>> 0xC6-0xC8: []

龜 is definitely correct (an obvious reference to the idiom 縮頭烏龜). 迹
looks correct.

>> board.phonehk.com/archiver/?tid-156148.html
>> <0xA1: [('0x9d', '0xeb')]
>> 0xA3: []
>> 0xC6-0xC8: []

Yes, 噃 looks correct.

>> www.millionbook.net/gd/h/huishuianyangjiumin/qmt/006.htm
>> <0xA1: [('0x8f', '0x73'), ('0x8e', '0x4e'), ('0x8e', '0x4e')]
>> 0xA3: []
>> 0xC6-0xC8: []

Since the page bears an obvious signature of having been
machine-converted from simplified Chinese, googling for the same piece
of text encoded in simplified Chinese should be useful. Most pages
google finds seem to be converted from the same or a similar defective
big5 source, but it also found this
lib.bgu.edu.cn/websql/date%5CI%5CA2024756.pdf , which gives the
“correct” simplified characters for the first sentence as

朱妈妈正在厨下催脸水, 刚进角门, 听得里边打骂 , 立住脚, 隔子眼里一瞧, 探知缘故,

For the second sentence, the PDF file gives an unknown character
(probably vendor-specific). The text in question is from a (apparently
banned) novel from the Qing dynasty, so, yes, it would be normal to
see some classical Chinese.

However, since the problematic page obviously contains errors (as the
first sentence does not in fact match what the PDF file says), I
suggest that we drop this page from our consideration.

>> www.toysdaily.com/discuz/forum-24-2.html
>> <0xA1: []
>> 0xA3: []
>> 0xC6-0xC8: [('0xc7', '0x55')]
>This is "[個人收藏]一抽即中 (One Piece Q版盒蛋 ~ 海底\xc7\x55樂園)" which
>links to this item:

Yes,  の is correct, as shown here:

>> forum.mingpao.com/cfm/Forum3.cfm?OwnerID=1&CategoryID=3&TopicID=524&Page=5
>> <0xA1: [('0x8e', '0xe0'), ('0x9d', '0xf8'), ('0x9d', '0xf8'), ('0x9d',
>> '0xf8')]
>> 0xA3: []
>> 0xC6-0xC8: []
>The source is a post by "又一痛\x8e\xe0":

脚 and 啲 are definitely correct. 吓 is also correct for this discussion
but it’s actually a “spelling mistake” (it should be just 下, no
Cantonese character necessary)

>> www30.discuss.com.hk/archiver/?tid-9026420.html
>> <0xA1: [('0x9d', '0xef')]
>> 0xA3: []
>> 0xC6-0xC8: []
>師兄你\x9d\xef表達能力仲驚人, 一語道破成件事.

Yes, 嘅 is correct

>> www28.discuss.com.hk/viewthread.php?tid=7539844&extra=page%3D1&page=10
>> <0xA1: [('0x9d', '0xf7')]
>> 0xA3: []
>> 0xC6-0xC8: []
>\x9d\xf7 also appeared in another source:
>> www.hacken.cc/bbs/thread-318592-6-1.html
>> <0xA1: [('0x9d', '0xf7'), ('0x89', '0x59'), ('0x89', '0x72')]
>> 0xA3: []
>> 0xC6-0xC8: []
>This is from a comment in mixed simplified and traditional Chinese. First
>the traditional bit:

Yes, 咗 looks correct.

>This is the simplified bit:

Yes, definitely 无, 发, and 经

无 is actually a rare non-simplified character. I can’t comment on the
others but it’s true that HKSCS contains some simplified characters.

>> www28.discuss.com.hk/viewthread.php?tid=7319244&extra=page%3D1&page=10
>> <0xA1: [('0xa0', '0x4f')]
>> 0xA3: []
>> 0xC6-0xC8: []

Definitely 嘢 (“thing”). 食嘢 means “to eat” (and in this context it is
the gerund “eating” – “15 minutes before eating”).

>> www.fhs.gov.hk/tc_chi/health_info/class_life/child/child.html
>> <0xA1: [('0x8f', '0xc0')]
>> 0xA3: []
>> 0xC6-0xC8: []

Yes, definitely 衞

>> www.books.com.tw/exep/prod/books/editorial/publisher_booklist.php?pubid=sharppnt&qseries=sharppnt9B05
>> <0xA1: []
>> 0xA3: []
>> 0xC6-0xC8: [('0xc7', '0x5c'), ('0xc7', '0x66'), ('0xc7', '0x5c'),
>> ('0xc7',
>> '0x66')]
>These are hiragana in 柴門ふみ which is simply the name of a Japanese
>author: http://en.wikipedia.org/wiki/Fumi_Saimon
>\xc7\x5c =>
>opera-hk: U+3075 ふ
>firefox: U+3075 ふ
>chrome: U+F72B 
>firefox-hk: U+3075 ふ
>opera: U+3075 ふ
>chrome-hk: U+3075 ふ
>internetexplorer: U+F72B 
>\xc7\x66 =>
>opera-hk: U+307F み
>firefox: U+307F み
>chrome: U+F735 
>firefox-hk: U+307F み
>opera: U+307F み
>chrome-hk: U+307F み
>internetexplorer: U+F735 
>U+F72B and U+F735 are in the PUA, so U+307F and U+3075 are correct.
>Winners: opera-hk, firefox, firefox-hk, opera, chrome-hk
>== Mixed encodings and other nonsense ==

Personally, I’d say mixed encodings inside <script> or <style> suggest
that the file started as gbk or big5 and then machine-converted to
utf-8. (Or, in the case of utf-8 filename, it’s just dynamically
generated in an environment where the developers are still using
gb/big5 but the filesystem is using utf8.) There’s nothing we can do.

In the case of forums or other user generated content, I strongly
suspect something along the lines of 1(c) above at work. Things can
get even worse if the forum then converts the regional encoding to
utf-8, sometimes by way of an incorrect intermediate regional

-ambrose <http://gniw.ca>
