I18N-ISSUE-377 (BUG21556): How to handle cases when no character comes between shift sequences in ISO-2022 and HZ-GB-2312 encodings [encoding]

I18N-ISSUE-377 (BUG21556): How to handle cases when no character comes between shift sequences in ISO-2022 and HZ-GB-2312 encodings [encoding]

http://www.w3.org/International/track/issues/377

Raised by: Addison Phillips
On product: encoding

https://www.w3.org/Bugs/Public/show_bug.cgi?id=21556

This issue tracks the bug listed above and was created as part of the WG LC process. The bug was created prior to the WG LC.

---

Section 3.6.2 of Unicode Technical Report 36 says that conversion must use replacements or cause an error or even for "unrecognized or 'empty' state-change sequences".  But this does not happen in the current encoding algorithms.

For example, in the "hz-gb-2312" algorithm:

0x7E 0x7B 0x7E 0x7D 0x20 results in U+0020, rather than a decoder error and 0x20 (since I presume that the empty shift sequence is illegal.)

Similarly, 0x7E 0x7D 0x7E 0x7B causes no decoder error for being an empty shift sequence.

In the "iso-2022-jp" algorithm:

0x1b 0x24 0x40 0x1b 0x28 0x42 0x20 (and other sequences like it) results in U+0020, rather than a decoder error and 0x20
(since I presume that the empty shift sequence is illegal.)

In the "iso-2022-kr" algorithm:

The byte sequence 0x0E 0x0E 0x0E ... results in no characters, rather than one or more decoder errors (at least for reaching the end of the stream with no characters).

The byte sequence 0x0F 0x0F 0x0F ... results in no characters, rather than one or more decoder errors (at least for reaching the end of the stream with no characters).

0x0E 0x0F 0x20 results in U+0020, rather than a decoder error and 0x20.

All the cases above indicate empty shift sequences not currently treated as decoder errors.

Should the encoding algorithms be changed to emit a decoder error if there are no characters in between shift sequences in "iso-2022-jp" and "iso-2022-kr"?  Or are the algorithms like this for compatibility? Another issue is how to deal with unrecognized ISO 2022 escape sequences; I feel that the current encoding algorithms don't deal with that well enough.

Received on Thursday, 10 July 2014 04:14:13 UTC