[whatwg] Encoding: big5 and big5-hkscs

On Wed, 04 Apr 2012 18:05:14 +0200, Anne van Kesteren <annevk at opera.com>  
wrote:

> On Fri, 30 Mar 2012 14:00:38 +0200, Anne van Kesteren <annevk at opera.com>  
> wrote:
>> Ideally someone does detailed content analysis to figure out what the  
>> best path forward is here, though I'm not entirely sure how.
>
> I still don't know how, but thanks to Simon Pieters I gathered some URLs  
>  from http://dotnetdotcom.org/ and found that 22 pages (of which at  
> least two are big5-hkscs encoded) out of 609 have byte sequences in the  
> ranges that are distinct between big5 and big5-hkscs and in most  
> implementations (in IE they are identical, in Opera big5-hkscs is a  
> superset I believe). The byte sequences found per URL are published  
> here: http://lists.w3.org/Archives/Public/www-archive/2012Apr/0020.html
>
>
> To go from (lead, trail) to an index usable in big5.json you can use a  
> function such as:
>
> def get_index(lead, trail):
>      row = 0xFE-0xA1 + RANGE + 1
>      cell = (trail-0xA1 + RANGE) if trail > (0x7E+1) else trail - 0x40
>      return (lead-0x81) * row + cell
>
> I can do that for the dataset, but I need someone who is able to  
> interpret the results to see which decoding makes more sense.

I've gone through the whole list of URLs and analyzed the pages. Using the  
*-hk mappings for data labeled as big5 would fix pretty much all of these  
pages. Not treating big5 and big5-hkscs as aliases is clearly breaking  
pages, so I would recommend a single mapping for both.

Of the existing mappings, opera-hk seems like the overall winner. As a  
starting point for the spec, I suggest taking the intersection of  
opera-hk, firefox-hk and chrome-hk.

The tedious but fun (if you like Chinese) analysis follows. In case the  
encoding is messed up in transit, it's also available at  
<https://gitorious.org/whatwg/big5/blobs/master/big5.txt>.

== The useful sources ==

These are byte sequences that appear to be deliberate and that make some  
sort of sense in context. I've written the context in Chinese, with the  
byte sequences under investigation left escaped on the form \x00\x00.

> leetm.mingpao.com/cfm/Forum3.cfm?CategoryID=2&TopicID=2720&TopicOrder=Desc&TopicPage=64
> <0xA1: [('0x8b', '0xf8'), ('0x90', '0x5b')]
> 0xA3: []
> 0xC6-0xC8: []

?????\x8b\xf8????

?????????\x90\x5b???????????????

\x8b\xf8 =>

opera-hk: U+F907 ?
firefox: U+80E7 ?
chrome: U+F570 ?
firefox-hk: U+F907 ?
opera: U+FFFD ?
chrome-hk: U+F907 ?
internetexplorer: U+F570 ?

\x90\x5b =>

opera-hk: U+8FF9 ?
firefox: U+823B ?
chrome: U+E466 ?
firefox-hk: U+8FF9 ?
opera: U+FFFD ?
chrome-hk: U+8FF9 ?
internetexplorer: U+E466 ?

The *-hk mappings seem correct, since ?? means turtle.

Winners: opera-hk, firefox-hk, chrome-hk


> board.phonehk.com/archiver/?tid-156148.html
> <0xA1: [('0x9d', '0xeb')]
> 0xA3: []
> 0xC6-0xC8: []

?????itune??mp4?,?????.m4a\x9d\xeb,????????

(Cantonese, about "uncompressing" mp4 to m4a...)

This was quoted from the previous comment, where the character in question  
was encoded as &#22083; That's ? (a modal particle), which seems to make  
sense here.

\x9d\xeb =>

opera-hk: U+5643 ?
firefox: U+ECCD ?
chrome: U+ECCD ?
firefox-hk: U+5643 ?
opera: U+FFFD ?
chrome-hk: U+5643 ?
internetexplorer: U+ECCD ?

Winners: opera-hk, firefox-hk, chrome-hk


> www.millionbook.net/gd/h/huishuianyangjiumin/qmt/006.htm
> <0xA1: [('0x8f', '0x73'), ('0x8e', '0x4e'), ('0x8e', '0x4e')]
> 0xA3: []
> 0xC6-0xC8: []

?????????????????????????????\x8f\x73???????????

?????\x8e\x4e???????????????\x8e\x4e???????

This looks like classical Chinese, which I don't understand. However, it's  
interesting to look at alternative mappings:

\x8e\x4e =>

opera-hk: U+259AC ?
firefox: U+86F1 ?
chrome: U+E31F ?
firefox-hk: U+E31F ?
opera: U+FFFD ?
chrome-hk: U+259AC ?
internetexplorer: U+E31F ?

At least on my computer, U+E31F and U+259AC are rendered the same, and  
that rendering matches <http://www.unicode.org/charts/PDF/U20000.pdf>.  
U+E31F is in the PUA, so U+259AC is correct.

\x8f\x73 =>

opera-hk: U+25C91 ?
firefox: U+9F80 ?
chrome: U+E3E1 ?
firefox-hk: U+E3E1 ?
opera: U+FFFD ?
chrome-hk: U+25C91 ?
internetexplorer: U+E3E1 ?

U+25C91 is correct for the same reasons.

Winners: opera-hk, chrome-hk

(Needs verification by someone who can read classical Chinese.)


> www.toysdaily.com/discuz/forum-24-2.html
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc7', '0x55')]

This is "[????]???? (One Piece Q??? ~ ??\xc7\x55??)" which  
links to this item:

http://www.toysdaily.com/discuz/thread-180080-1-2.html

\xc7\x55 is the Japanese hiragana ?, which is occasionally used instead  
of ? or ?, see <http://en.wiktionary.org/wiki/?#Mandarin>.

\xc7\x55 =>

opera-hk: U+306E ?
firefox: U+306E ?
chrome: U+F724 ?
firefox-hk: U+306E ?
opera: U+306E ?
chrome-hk: U+306E ?
internetexplorer: U+F724 ?

U+F724 is in the PUA, so U+306E is correct.

Winners: opera-hk, firefox, firefox-hk, opera, chrome-hk


> forum.mingpao.com/cfm/Forum3.cfm?OwnerID=1&CategoryID=3&TopicID=524&Page=5
> <0xA1: [('0x8e', '0xe0'), ('0x9d', '0xf8'), ('0x9d', '0xf8'), ('0x9d',
> '0xf8')]
> 0xA3: []
> 0xC6-0xC8: []

The source is a post by "???\x8e\xe0":

????????????,?????,??\x9d\xf8????????????,&#25274;??????.?????????,???????????,????????????????????,???\x9d\xf8?????,??????????????&#25274;????.???????????????????,????,??????,?????&#25274;????????\xfa\xef?.??\x9d\xf8???,????????.

(Cantonese, criticizing the western media's anti-Chinese bias.)

\x8e\xe0 =>

opera-hk: U+811A ?
firefox: U+9C82 ?
chrome: U+E38F ?
firefox-hk: U+811A ?
opera: U+FFFD ?
chrome-hk: U+811A ?
internetexplorer: U+E38F ?

\x9d\xf8 =>

opera-hk: U+5572 ?
firefox: U+9C53 ?
chrome: U+ECDA ?
firefox-hk: U+5572 ?
opera: U+FFFD ?
chrome-hk: U+5572 ?
internetexplorer: U+ECDA ?

\xfa\xef =>

opera-hk: U+5413 ?
firefox: U+7E92 ?
chrome: U+E08D ?
firefox-hk: U+5413 ?
opera: U+FFFD ?
chrome-hk: U+5413 ?
internetexplorer: U+E08D ?

The *-hk mappings look very plausible, especially given ????. The rest  
are pretty obviously wrong.

Winners: opera-hk, firefox-hk, chrome-hk

(Needs verification by someone who knows Cantonese.)


> www30.discuss.com.hk/archiver/?tid-9026420.html
> <0xA1: [('0x9d', '0xef')]
> 0xA3: []
> 0xC6-0xC8: []

???\x9d\xef???????, ???????.

\x9d\xef =>

opera-hk: U+5605 ?
firefox: U+9B8B ?
chrome: U+ECD1 ?
firefox-hk: U+5605 ?
opera: U+FFFD ?
chrome-hk: U+5605 ?
internetexplorer: U+ECD1 ?

? seems correct in context.

Winners: opera-hk, firefox-hk, chrome-hk


> www28.discuss.com.hk/viewthread.php?tid=7539844&extra=page%3D1&page=10
> <0xA1: [('0x9d', '0xf7')]
> 0xA3: []
> 0xC6-0xC8: []

??????????????????\x9d\xf7???????????????????

\x9d\xf7 also appeared in another source:

> www.hacken.cc/bbs/thread-318592-6-1.html
> <0xA1: [('0x9d', '0xf7'), ('0x89', '0x59'), ('0x89', '0x72')]
> 0xA3: []
> 0xC6-0xC8: []

This is from a comment in mixed simplified and traditional Chinese. First  
the traditional bit:

??????\x9d\xf7??:

\x9d\xf7 =>

opera-hk: U+5497 ?
firefox: U+9C26 ?
chrome: U+ECD9 ?
firefox-hk: U+5497 ?
opera: U+FFFD ?
chrome-hk: U+5497 ?
internetexplorer: U+ECD9 ?

U+5497 ? seems correct, the rest are obviously bogus.

This is the simplified bit:

???????\xfc\xd3?\x89\x59?????????
1??????????????????
2???\x89\x72?????

\xfc\xd3 =>

opera-hk: U+65E0 ?
firefox: U+75C3 ?
chrome: U+E1AB ?
firefox-hk: U+65E0 ?
opera: U+FFFD ?
chrome-hk: U+65E0 ?
internetexplorer: U+E1AB ?

\x89\x59 =>

opera-hk: U+53D1 ?
firefox: U+829C ?
chrome: U+F3B9 ?
firefox-hk: U+53D1 ?
opera: U+FFFD ?
chrome-hk: U+53D1 ?
internetexplorer: U+F3B9 ?

\x89\x72 =>

opera-hk: U+7ECF ?
firefox: U+8F93 ?
chrome: U+F3D2 ?
firefox-hk: U+7ECF ?
opera: U+FFFD ?
chrome-hk: U+7ECF ?
internetexplorer: U+F3D2 ?

It's complete news to me that Big5-HKSCS can encode some simplified  
Chinese characters, but the *-hk versions mappings are correct.

Winners: opera-hk, firefox-hk, chrome-hk


> www28.discuss.com.hk/viewthread.php?tid=7319244&extra=page%3D1&page=10
> <0xA1: [('0xa0', '0x4f')]
> 0xA3: []
> 0xC6-0xC8: []

????????\x0a\x4f??????????

\x0a\x4f =>

opera-hk: U+24ABB ?
firefox: U+5622 ?
chrome: U+EE2A ?
firefox-hk: U+EE2A ?
opera: U+FFFD ?
chrome-hk: U+24ABB ?
internetexplorer: U+EE2A ?

This is Cantonese, which I don't really know, but from some searching the  
firefox mapping looks plausible. However, U+EE2A (PUA) and U+24ABB looks  
the same in some fonts, so probably U+24ABB is correct.

Winners: ?


> www.fhs.gov.hk/tc_chi/health_info/class_life/child/child.html
> <0xA1: [('0x8f', '0xc0')]
> 0xA3: []
> 0xC6-0xC8: []

<a href="http://www.dh.gov.hk/" target="_blank"><img  
src="../../../images/health_info/health_info_02.jpg" alt="\x8f\xc0??"  
border="0"></a>

\x8f\xc0 =>

opera-hk: U+885E ?
firefox: U+7F33 ?
chrome: U+E40C ?
firefox-hk: U+885E ?
opera: U+FFFD ?
chrome-hk: U+885E ?
internetexplorer: U+E40C ?

Follow the link to http://www.dh.gov.hk/ and there can be no doubt that  
??? is correct.

Winners: opera-hk, firefox-hk, chrome-hk


> www.books.com.tw/exep/prod/books/editorial/publisher_booklist.php?pubid=sharppnt&qseries=sharppnt9B05
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc7', '0x5c'), ('0xc7', '0x66'), ('0xc7', '0x5c'),  
> ('0xc7',
> '0x66')]

These are hiragana in ???? which is simply the name of a Japanese  
author: http://en.wikipedia.org/wiki/Fumi_Saimon

\xc7\x5c =>

opera-hk: U+3075 ?
firefox: U+3075 ?
chrome: U+F72B ?
firefox-hk: U+3075 ?
opera: U+3075 ?
chrome-hk: U+3075 ?
internetexplorer: U+F72B ?

\xc7\x66 =>

opera-hk: U+307F ?
firefox: U+307F ?
chrome: U+F735 ?
firefox-hk: U+307F ?
opera: U+307F ?
chrome-hk: U+307F ?
internetexplorer: U+F735 ?

U+F72B and U+F735 are in the PUA, so U+307F and U+3075 are correct.

Winners: opera-hk, firefox, firefox-hk, opera, chrome-hk


== Mixed encodings and other nonsense ==

> hkhk.org/viewthread.php?tid=22286&extra=page%3D1
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc8', '0xa1')]

GBK-encoded comments in <style> and <script>, e.g.:

//??classname?t_msgfontfix ??


> www.epochtimes.com/b5/7/1/12/n1588315.htm
> <0xA1: [('0x8b', '0x20')]
> 0xA3: []
> 0xC6-0xC8: []
>
> epochtimes.com/b5/7/12/23/n1951744.htm
> <0xA1: [('0x8b', '0x20')]
> 0xA3: []
> 0xC6-0xC8: []

Both of these are UTF-8 in a JavaScript comment:

/* DJY left 250x250, ??? 2010/11/18 */


> photo.pchome.com.tw/wen657476/045/
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc6', '0xe4'), ('0xc6', '0xe4'), ('0xc7', '0xae')]

\xc6\xe4 is in a script encoded as GBK:

{Icon:'/s12/w/e/wen657476/book45/p121059706328s.jpg', PK:121059706328,  
Title:'DataSet[ 20079-??.jpg ]', Desc:'DataSet[ 20079-??.jpg ]'}

\xc7\xae is a link to http://photo.pchome.com.tw/wen657476/119307520020  
encoded as GBK:

<a href="/wen657476/119307520020">k005-????.jpg(1)</a>

???? means "roman wallet", which is exactly what is being sold.


> www.eye.hk/bbs/zboard.php?category=2&id=eyeglasses_collestables&page=1&page_num=999&sn=off&ss=on&sc=on&keyword=&select_arrange=headnum&desc=asc
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc7', '0xd7'), ('0xc8', '0xd5'), ('0xc7', '0xd7'),  
> ('0xc8',
> '0xd5'), ('0xc7', '0xd7'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8',
> '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8',
> '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8',
> '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc6',
> '0xb0'), ('0xc6', '0xe4')]

This page is mixed Big5 and GBK, nothing could save it.


> www.izincan.com/board/novelsys.php?arid=65987
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc8', '0xeb')]

Comments in the JavaScript code at the end of the document are in GBK.


> tvcity.tvb.com/drama/wasabi_mon_amour/story/002.html
> <0xA1: [('0x8f', '0x58'), ('0x92', '0xe5'), ('0x8b', '0x95'), ('0x88',
> '0xe5'), ('0x8d', '0x80'), ('0x83', '0x3c'), ('0x8b', '0xe8'), ('0x88',
> '0x8a'), ('0x8b', '0xe8'), ('0x98', '0xe6'), ('0x92', '0xe5'), ('0x8b',
> '0x95'), ('0x81', '0x9e'), ('0x9f', '0xe6'), ('0x81', '0x93'), ('0x8f',
> '0xb8'), ('0x87', '0xe6'), ('0x96', '0x99'), ('0x8d', '0xe5'), ('0x8b',
> '0x99'), ('0x9d', '0xe6'), ('0x8a', '0xe6'), ('0x88', '0xb2'), ('0x88',
> '0xe7'), ('0x9f', '0xa5'), ('0x89', '0x8d'), ('0x9b', '0xe8'), ('0x81',
> '0x98'), ('0x91', '0x8a'), ('0x91', '0xe5'), ('0x91', '0x3c'), ('0x8f',
> '0xe9')]
> 0xA3: [('0xa3', '0xe5')]
> 0xC6-0xC8: [('0xc7', '0x55')]

The top part of the page is in Big5-HKSCS while the site navigation at the  
bottom is in UTF-8.


> www.china-holiday.com/big5/big5train/skbzhwsy3.asp?zrxx=ccxs&sfcc=???&cx=??
> <0xA1: [('0x97', '0xe4'), ('0x83', '0xa8'), ('0x97', '0xe4'), ('0x83',
> '0xa8')]
> 0xA3: []
> 0xC6-0xC8: []

??? and ?? are encoded as UTF-8 and end up in the <title>...


> www.iis.sinica.edu.tw/page/library/TechReport/tr2002/threebone02.html
> <0xA1: [('0x87', '0xe8'), ('0x93', '0xe5'), ('0xa0', '0xb1'), ('0x8a',
> '0x3c')]
> 0xA3: []
> 0xC6-0xC8: []

This page page is mislabeled; it's actually encoded in UTF-8.


> bbs.rc-evo.com/viewthread.php?tid=73138&page=1&authorid=2487
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc6', '0xbc'), ('0xc6', '0xbc'), ('0xc6', '0xbc')]

The page must have changed, I can't find \xc6\xbc


> rumotan.com/guan/modules/tinyd3/
> <0xA1: [('0x90', '0xe8'), ('0x83', '0xe6'), ('0x90', '0xe8'), ('0x8f',
> '0xb4'), ('0x9d', '0xe8'), ('0x8f', '0xb4'), ('0x8f', '0xb4'), ('0x9c',
> '0x8b'), ('0x95', '0xab'), ('0x90', '0xe8'), ('0x95', '0xab'), ('0x8f',
> '0xb4'), ('0x9d', '0xe8'), ('0x8f', '0xb4'), ('0x8f', '0xb4'), ('0x81',
> '0xa3'), ('0x9d', '0xe8'), ('0x8f', '0xb4'), ('0x8f', '0xaf'), ('0x9d',
> '0xe8'), ('0x8f', '0xb4'), ('0x90', '0xe5'), ('0x9c', '0x8b'), ('0x90',
> '0xe8'), ('0x99', '0x2c'), ('0x82', '0xe6'), ('0x8f', '0x90'), ('0x9b',
> '0xe8'), ('0x97', '0x9d'), ('0x93', '0xe5'), ('0x90', '0xe8'), ('0x94',
> '0xb6'), ('0x8f', '0xe8'), ('0x88', '0x87'), ('0x95', '0xe8'), ('0x82',
> '0xe5'), ('0x8f', '0xb4'), ('0x8c', '0x31'), ('0x94', '0x9f'), ('0x97',
> '0xe6'), ('0x90', '0xe4'), ('0x8c', '0xe5'), ('0x9b', '0xe5'), ('0x8c',
> '0xe8'), ('0x99', '0x9f'), ('0x83', '0xe7'), ('0x95', '0xab'), ('0x8b',
> '0xe4'), ('0x82', '0xe6'), ('0x9b', '0xbe'), ('0x9a', '0xe6'), ('0x9c',
> '0x8b'), ('0x8e', '0xe8'), ('0x94', '0xe6'), ('0x9c', '0x83'), ('0x86',
> '0xe4'), ('0x81', '0xe8'), ('0x81', '0xb7'), ('0x82', '0x22'), ('0x92',
> '0xe5'), ('0x82', '0xe8'), ('0x97', '0x9d'), ('0x93', '0xe7'), ('0x90',
> '0xe8'), ('0x99', '0x2c')]
> 0xA3: [('0xa3', '0xe4')]
> 0xC6-0xC8: []

There's a chunk of UTF-8 in <meta>, so I looked no further.

-- 
Philip J?genstedt
Core Developer
Opera Software

Received on Friday, 6 April 2012 03:54:53 UTC