- From: Ambrose LI <ambrose.li@gmail.com>
- Date: Mon, 23 Apr 2012 19:12:14 -0400
- To: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>
- Cc: W3C HTML5 中文興趣小組 <public-html-ig-zh@w3.org>, Philip Jägenstedt <philipj@opera.com>, Yuan Chao <yuanchao@gmail.com>
2012/4/22 Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu>: > (12/04/22 4:26), Ambrose LI wrote: >> Personally speaking, I’d say that Big5 has always been a mess and it >> is still a mess, and the only sane way to solve this problem is to >> expose the underlying variants of Big5 in the encoding selection menu. >> Even if some sort of statistical AI technique were used there will >> still be occasions where what the machine chooses will be wrong. Just >> let the user choose if something doesn’t work. > > Yeah, that's a must, but we still need to decide on a default behavior :( I don’t know… if we must decide on a default behaviour but all behaviours are wrong, then maybe we should just train a MaxEnt classifier and use that to determine the correct encoding on a page-by-page basis =P [...] > (12/04/22 1:07), Philip Jägenstedt wrote: >> On Sat, 21 Apr 2012 16:26:07 +0200, Yuan Chao <yuanchao@gmail.com> wrote: >>> A visible character is very useful instead of a fullwidth space, >>> which just hides things away. >> >> How U+FFFD is rendered appears to be a font issue, I presume you don't >> mean that random incorrect characters is preferable. > > Unlike Yuan, my opinion is > > successfully decoded characters >> U+FFFD rendered as fullwidth space > > U+FFFD rendered as some mysterious glyph > mojibake > > Even if "how U+FFFD is rendered" is a font issue, if I want to make a > assessment about whether big5-uao or big5-hkscs is preferable for .tw > content based on weighing the statistics (i.e. give minus scores to the > 190 sites that would yield U+FFFD with HKSCS), U+FFFD rendered as > fullwidth space might actually make me think big5-hkscs is better than > unaltered big5-uao for .tw content. I believe “how U+FFFD is rendered is a font issue” is a real problem, but not an HTML-specific problem. I have seen people who intentionally chose the “invalid character” symbol (often when the invalid character symbol is some sort of a black square or otherwise resembles a bullet or other usable dingbat) because they thought that’s the glyph they wanted. But then there really isn’t much anyone can do anything about it. > (12/04/21 19:55), Philip Jägenstedt wrote: >> Perhaps there does exist a non-HKSCS mapping that would work better >> than Firefox's Big5 for Taiwan sites, but I'm really not sure how to >> define it. One would probably need to figure out which OS and fonts >> were used to produce the content and base the mapping on that. > > Yeah, I have no idea where those big5-2003 content is from. Is there a > Linux distribution that interpret big5 as what the big5-2003 standard > syas? Or is there a UAO version that's closer to big5-2003? I really wonder: Does anyone actually use those 中文系統 software in Hong Kong or Taiwan these days? I find it really hard to believe that anyone would still use them, but they seem to be still being sold so presumably there must still be a demand… If people *are* in fact still using these things then we’d have a ready explanation (however unlikely it might be) for the Big5-2003 and other oddly-encoded pages that we are seeing… -- cheers, -ambrose <http://gniw.ca>
Received on Monday, 23 April 2012 23:12:43 UTC