- From: jungshik <notifications@github.com>
- Date: Fri, 13 Oct 2017 19:55:55 +0000 (UTC)
- To: whatwg/encoding <encoding@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
Received on Friday, 13 October 2017 19:56:31 UTC
> It seems harmful, and against the goal of avoiding the PUA, to change byte sequences that previously decoded to non-PUA code points to decode to PUA code points. This means that data out there that previously decoded to (assigned in Unicode) non-PUA code points would start mapping to the PUA. In principle, I agree with you. A practical question is which way is more widely used, "0xA6D9 -> U+E78D" or "0x82359037 -> U+9FB4" to represent a character whose glyph looks like that of U+9FB4 ? My guess is that 0xA6D9 has been used a lot more often to represent a character that looks like U+9FB4 (encoded in U+E78D in some fonts) than '0x82359037' in GB 18030 documents. If 4-byte sequences for the 24 characters in question is extremely rare (virtually non-existent) while 2-byte sequences are relatively common (still pretty rare), the harm of repeating the 2005 change for the 24 characters is relatively contained. It also has a benefit of being able to display legacy GB18030-encoded documents on Android and elsewhere where there's no font coverage for U+Exxx PUA code points. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/whatwg/encoding/issues/27#issuecomment-336552061
Received on Friday, 13 October 2017 19:56:31 UTC