Re: [whatwg/encoding] Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input (#115)

To address Mark Davis' [request for formal feedback](https://www.unicode.org/mail-arch/unicode-ml/y2018-m12/0006.html) I started writing something. While doing so, I came up with this:

---

Generate U+FFFD if:
    • A state transition was made such that the previous state had no content and the previous state was not the ASCII state. (I.e. stop generating U+FFFD if the zero-length state is the ASCII state.)
    • A state transition to the ASCII state was preceded by the Roman state and the next byte was not 0x5C, 0x7E or the end of the stream.
    • A state transition to the Roman state was made and the next byte was not 0x5C, 0x7E, 0x1B or the end of the stream. (0x1B is on this list to avoid a case where both this rule and the first rule would apply at the same time resulting in two U+FFFDs.)

---

This would actually hold up the security properties that UTR 36 tries to uphold but fails to _and_ this would avoid the unwanted U+FFFD generation reported in the email cases.

The key question is if imposing the requirement that ASCII to Roman and vice versa transitions can only happen when logically necessary and then at the last possible moment is feasible given the behavior of encoders out there.

Does anyone want to volunteer to research this by checking the behavior of existing encoders of my searching archives of old Japanese email for the relevant byte patterns?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/115#issuecomment-446188066

Received on Tuesday, 11 December 2018 12:38:50 UTC