Re: [whatwg/encoding] Consider defining Japanese encoding sniffing (#157)

Firefox 69 implement the following algorithm. I'm providing this in the interest of documenting interop-relevant things. Since Chromium has adopted broader sniffing and Edge switched from the "no sniffing" camp to Chromium's "sniff a lot" camp, I expect to make Gecko sniff for more encodings in the near future. That's why I'm not suggesting standardizing this iteration:

Let fallback be the encoding that would be used absent content-based sniffing.

If fallback is not one of ISO-2022-JP, Shift_JIS or EUC-JP, return fallback.

Set booleans "ISO-2022-JP disqualified" and "escape seen" to false.

Set byte "second byte in escape" to zero.

Initialize a decoder for Shift_JIS and another for EUC-JP.

For each byte in the stream:

1. If "ISO-2022-JP disqualified" is false:

   1. If byte is larger than 0x7F, set "ISO-2022-JP disqualified" to true and break out of these substeps to outer step 2.

   2. If "escape seen" is false and byte is 0x1B, set "escape seen" to true and continue.

   3. If "escape seen" is true and "second byte in escape" is zero and byte is either 0x28 or 0x24, set "second byte in escape" to byte and continue.

   4. If the pair ("second byte in escape", byte) is any of (0x28, 0x42), (0x28, 0x4A), (0x28, 0x49), (0x24, 0x40) or (0x24, 0x42), return ISO-2022-JP.

   5. If "escape seen" is true, set "ISO-2022-JP disqualified" to true and continue.

   6. Continue.

2. Pass byte to the handler of the EUC-JP decoder.

3. If the handler of the EUC-JP decoder returned error or a code point in the range U+FF61...U+FF9F, inclusive, return Shift_JIS.

4. Pass byte to the handler of the Shift_JIS decoder.

5. If the handler of the Shift_JIS decoder returned error or a code point in the range U+FF61...U+FF9F, inclusive, return EUC-JP.

6. If byte is end-of-stream, return fallback.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/157#issuecomment-519031981

Received on Wednesday, 7 August 2019 10:06:28 UTC