Re: [whatwg/encoding] If gb18030 is revised, consider aligning the Encoding Standard (#27) from jungshik on 2017-10-13 (public-webapps-github@w3.org from October 2017)

From: jungshik <notifications@github.com>
Date: Fri, 13 Oct 2017 19:55:55 +0000 (UTC)
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/27/336552061@github.com>

> It seems harmful, and against the goal of avoiding the PUA, to change byte sequences that previously decoded to non-PUA code points to decode to PUA code points. This means that data out there that previously decoded to (assigned in Unicode) non-PUA code points would start mapping to the PUA.

In principle, I agree with you. A practical question is which way is more widely used, 
"0xA6D9 -> U+E78D" or  "0x82359037 -> U+9FB4"  to represent a character whose glyph looks like that of U+9FB4 ?  

My guess is that 0xA6D9 has been used a lot more often to represent a character that looks like U+9FB4 (encoded in U+E78D in some fonts) than '0x82359037'  in GB 18030 documents. 

If 4-byte sequences for the 24 characters in question is extremely rare (virtually non-existent) while 2-byte sequences are relatively common (still pretty rare),  the harm of repeating the 2005 change for the 24 characters is relatively contained. It also has a benefit of being able to display legacy GB18030-encoded documents on Android and elsewhere where there's no font coverage for U+Exxx PUA code points. 




-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/27#issuecomment-336552061

Received on Friday, 13 October 2017 19:56:31 UTC