Re: [whatwg/encoding] Remove the last 14 characters PUA of GB18030-2005 (#27) from jungshik on 2016-09-10 (public-webapps-github@w3.org from September 2016)

From: jungshik <notifications@github.com>
Date: Sat, 10 Sep 2016 03:59:40 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Message-ID: <whatwg/encoding/issues/27/246105676@github.com>

This issue was raised by me last year-early this year in #22 (and https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 ).

As I wrote there, the current mapping makes it impossible to display those characters involved [1] on some platforms (Android and Windows 10 [2]) when they're encoded in GB 18030 because there is NO font covering the corresponding PUA code points. This is one of the most serious consequences of the current mapping to me (besides other consequences mentioned earlier).

OTOH, if there are multiple fonts covering those PUA points with different interpretations, there's no easy way to pick *the right* one (if the only information at hand is code points) because the identify of a PUA code point is up to private parties and is indeterministic by definition.

( Needless to say, there'd be no such problem if UTF-8 is used with regular code points and we want everybody to use UTF-8 on the web. )

Given all these, removing any mapping to PUA code points (as long as there are regular Unicode characters) is desired.

As mentioned in #22, I initially thought that GB18030:2005 had fixed all these up (by 2005, all the characters originally mapped to PUA code points had been encoded in the Unicode) in a way similar to what's done for HKSCS. It turned out that that was not the case, which was rather disappointing. As a result (and 24 characters affected are rarely used - especially the U+FE1x) , the change for #22 was minimal (only one code point was fixed per GB18030:2005).

Given that GB18030 will be revised soon (per @kenlunde) to eliminate the canonical mapping to PUA code points, Chromium is more than willing to go ahead with mapping the 24 byte-sequences in GB18030 to regular Unicode characters.

[1] In addition to the 14 CJK ideographs/radicals listed earlier, there are vertical form variants that are still mapped to PUA code points. (well, U+FE1x will be virtually unused in gb18030-encoded documents).
\xA6\xD9 U+E78D U+0fe10
\xA6\xDA U+E78E U+0fe12
\xA6\xDB U+E78F U+0fe11
\xA6\xDC U+E790 U+0fe13
\xA6\xDD U+E791 U+0fe14
\xA6\xDE U+E792 U+0fe15
\xA6\xDF U+E793 U+0fe16
\xA6\xEC U+E794 U+0fe17
\xA6\xED U+E795 U+0fe18
\xA6\xF3 U+E796 U+0fe19
\xFE\x51 U+E816 U+20087
\xFE\x52 U+E817 U+20089
\xFE\x53 U+E818 U+200cc
\xFE\x59 U+E81E U+09fb4
\xFE\x61 U+E826 U+09fb5
\xFE\x66 U+E82B U+09fb6
\xFE\x67 U+E82C U+09fb7
\xFE\x6C U+E831 U+215d7
\xFE\x6D U+E832 U+09fb8
\xFE\x76 U+E83B U+2298f
\xFE\x7E U+E843 U+09fb9
\xFE\x90 U+E854 U+09fba
\xFE\x91 U+E855 U+241fe
\xFE\xA0 U+E864 U+09fbb

[2] Android (at least Google's Nexus devices) does not have any font covering the PUA code points listed in [1].
Out of the box (perhaps unless your UI language is Simplified Chinese), Windows 10 does not have Simsun with the PUA code point coverage while it has a newer Chinese font - Microsoft YaHei - with the corresponding regular code point coverage. One can manually add Simsun, though.
At the moment, Chrome OS does have a font covering them (MSung GB18030), but may not in the future.

--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/27#issuecomment-246105676

Received on Saturday, 10 September 2016 11:00:09 UTC