W3C home > Mailing lists > Public > public-html-ig-zh@w3.org > April 2012

Re: 求助:關於Big5和Big5-HKSCS的問題

From: Philip Jägenstedt <philipj@opera.com>
Date: Tue, 10 Apr 2012 21:49:11 +0200
To: "Ambrose LI" <ambrose.li@gmail.com>
Cc: "public-html-ig-zh@w3.org" <public-html-ig-zh@w3.org>, Øistein E. Andersen <liszt@coq.no>, "Anne van Kesteren" <annevk@opera.com>
Message-ID: <op.wcken9yrsr6mfa@localhost.localdomain>
On Mon, 09 Apr 2012 16:08:38 +0200, Ambrose LI <ambrose.li@gmail.com>  
wrote:

> Hi,
>
> I will be commenting on your quoted post ([whatwg] Encoding: big5 and
> big5-hkscs).

Many thanks for your feedback, Ambrose! Additional thoughts on a few bits:

>>> www.millionbook.net/gd/h/huishuianyangjiumin/qmt/006.htm
>>> <0xA1: [('0x8f', '0x73'), ('0x8e', '0x4e'), ('0x8e', '0x4e')]
>>> 0xA3: []
>>> 0xC6-0xC8: []
>>
>> 那朱媽媽正在廚下催臉水,剛進角門,听得里邊打罵,立住腳,向\x8f\x73子眼里一瞧,探知緣故。
>>
>> ‘槐蔭未擎\x8e\x4e鷺足’,是宮槐之下,未列著鷺序\x8e\x4e班,喻未仕也。
>
> Since the page bears an obvious signature of having been
> machine-converted from simplified Chinese, googling for the same piece
> of text encoded in simplified Chinese should be useful.

Yes, now that you mention it it is pretty obvious. It's not only machine  
translated but very poorly so, with things like 甚么, 今后 and 几句...

> Most pages
> google finds seem to be converted from the same or a similar defective
> big5 source, but it also found this
> lib.bgu.edu.cn/websql/date%5CI%5CA2024756.pdf , which gives the
> “correct” simplified characters for the first sentence as
>
> 朱妈妈正在厨下催脸水, 刚进角门, 听得里边打骂 , 立住脚, 隔子眼里一瞧,  
> 探知缘故,
>
> For the second sentence, the PDF file gives an unknown character
> (probably vendor-specific). The text in question is from a (apparently
> banned) novel from the Qing dynasty, so, yes, it would be normal to
> see some classical Chinese.
>
> However, since the problematic page obviously contains errors (as the
> first sentence does not in fact match what the PDF file says), I
> suggest that we drop this page from our consideration.

Well hunted, thank you! Since no candidate mapping is correct for this  
page, there's nothing we can do about it.

>>> www28.discuss.com.hk/viewthread.php?tid=7319244&extra=page%3D1&page=10
>>> <0xA1: [('0xa0', '0x4f')]
>>> 0xA3: []
>>> 0xC6-0xC8: []
>>
>> 『飢餓穴』是臨食\x0a\x4f之前十五分鐘去按呢!
>
> Definitely 嘢 (“thing”). 食嘢 means “to eat” (and in this context it is
> the gerund “eating” – “15 minutes before eating”).

This is very unfortunate, because it means that there are pages labeled  
with <meta charset=big5> that depend on *different* extensions of Big5. To  
summarize the pages analyzed in  
<http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035370.html>:

Works with the *-hk mappings, broken with the firefox mapping:

http://leetm.mingpao.com/cfm/Forum3.cfm?CategoryID=2&TopicID=2720&TopicOrder=Desc&TopicPage=64
http://forum.mingpao.com/cfm/Forum3.cfm?OwnerID=1&CategoryID=3&TopicID=524&Page=5
http://www.discuss.com.hk/archiver/?tid-9026420.html
http://www.discuss.com.hk/viewthread.php?tid=7539844&extra=page%253D1&page=10
http://www.hacken.cc/bbs/thread-318592-6-1.html
http://www.fhs.gov.hk/tc_chi/health_info/class_life/child/child.html

Works with the firefox mapping, broken with the *-hk mappings:

http://www.discuss.com.hk/viewthread.php?tid=7319244&extra=page%253D1&page=10

I don't know how representative the sample is, but I would be very  
surprised if Firefox's interpretation of big5 fixes more pages than it  
breaks, given that IE has always only had a single big5 mapping and that  
no other browser agrees with Firefox.

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Tuesday, 10 April 2012 19:49:50 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 10 April 2012 19:49:50 GMT