W3C home > Mailing lists > Public > public-html-ig-zh@w3.org > April 2012

Re: 臺灣和香港Big5 HKSCS vs UAO分析和結論

From: Ambrose LI <ambrose.li@gmail.com>
Date: Mon, 23 Apr 2012 19:12:14 -0400
Message-ID: <CADJvFOXq0C3GB8mmwTCWrzvu=HJQvPsaga==8q=EowvFZ_fZhg@mail.gmail.com>
To: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>
Cc: W3C HTML5 中文興趣小組 <public-html-ig-zh@w3.org>, Philip Jägenstedt <philipj@opera.com>, Yuan Chao <yuanchao@gmail.com>
2012/4/22 Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu>:
> (12/04/22 4:26), Ambrose LI wrote:
>> Personally speaking, I’d say that Big5 has always been a mess and it
>> is still a mess, and the only sane way to solve this problem is to
>> expose the underlying variants of Big5 in the encoding selection menu.
>> Even if some sort of statistical AI technique were used there will
>> still be occasions where what the machine chooses will be wrong. Just
>> let the user choose if something doesn’t work.
>
> Yeah, that's a must, but we still need to decide on a default behavior :(

I don’t know… if we must decide on a default behaviour but all
behaviours are wrong, then maybe we should just train a MaxEnt
classifier and use that to determine the correct encoding on a
page-by-page basis =P

[...]
> (12/04/22 1:07), Philip Jägenstedt wrote:
>> On Sat, 21 Apr 2012 16:26:07 +0200, Yuan Chao <yuanchao@gmail.com> wrote:
>>> A visible character is very useful instead of a fullwidth space,
>>> which just hides things away.
>>
>> How U+FFFD is rendered appears to be a font issue, I presume you don't
>> mean that random incorrect characters is preferable.
>
> Unlike Yuan, my opinion is
>
> successfully decoded characters >> U+FFFD rendered as fullwidth space >
> U+FFFD rendered as some mysterious glyph > mojibake
>
> Even if "how U+FFFD is rendered" is a font issue, if I want to make a
> assessment about whether big5-uao or big5-hkscs is preferable for .tw
> content based on weighing the statistics (i.e. give minus scores to the
> 190 sites that would yield U+FFFD with HKSCS), U+FFFD rendered as
> fullwidth space might actually make me think big5-hkscs is better than
> unaltered big5-uao for .tw content.

I believe “how U+FFFD is rendered is a font issue” is a real problem,
but not an HTML-specific problem. I have seen people who intentionally
chose the “invalid character” symbol (often when the invalid character
symbol is some sort of a black square or otherwise resembles a bullet
or other usable dingbat) because they thought that’s the glyph they
wanted. But then there really isn’t much anyone can do anything about
it.


> (12/04/21 19:55), Philip Jägenstedt wrote:
>> Perhaps there does exist a non-HKSCS mapping that would work better
>> than Firefox's Big5 for Taiwan sites, but I'm really not sure how to
>> define it. One would probably need to figure out which OS and fonts
>> were used to produce the content and base the mapping on that.
>
> Yeah, I have no idea where those big5-2003 content is from. Is there a
> Linux distribution that interpret big5 as what the big5-2003 standard
> syas? Or is there a UAO version that's closer to big5-2003?

I really wonder: Does anyone actually use those 中文系統 software in Hong
Kong or Taiwan these days? I find it really hard to believe that
anyone would still use them, but they seem to be still being sold so
presumably there must still be a demand…

If people *are* in fact still using these things then we’d have a
ready explanation (however unlikely it might be) for the Big5-2003 and
other oddly-encoded pages that we are seeing…


-- 
cheers,
-ambrose <http://gniw.ca>
Received on Monday, 23 April 2012 23:12:43 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:43:50 UTC