[whatwg/encoding] Consider defining Japanese encoding sniffing (#157) from Henri Sivonen on 2018-09-28 (public-webapps-github@w3.org from September 2018)

From: Henri Sivonen <notifications@github.com>
Date: Fri, 28 Sep 2018 03:44:39 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/157@github.com>

This is just to write down some ideas about specifying Japanese encoding detection instead of saying anything goes. It's possible that it's now viable to ship a browser without Japanese encoding detection, but I believe all four major engines have some kind of Japanse encoding detection, so maybe it should become well-specified. I think detecting ISO-2022-JP is probably bad, but since at least Gecko and Blink detect it, I'm writing it into the algorithm. It's easy to remove if needed.

This algorithm is meant be used only when step 7 in https://html.spec.whatwg.org/#determining-the-character-encoding is reached and step 8 would return Shift_JIS.

Set booleans "ISO-2022-JP disqualified" and "escape seen" to false.

Set byte "second byte in escape" to zero.

Set integers "Shift_JIS Kanji", "Shift_JIS Hiragana", "Shift_JIS Katakana", "EUC-JP Kanji", "EUC-JP Hiragana" and "EUC-JP Katakana" to zero.

Initialize a decoder for Shift_JIS and another for EUC-JP.

For each byte in the stream:

1. If "ISO-2022-JP disqualified" is false:

   1. If byte is larger than 0x7F, set "ISO-2022-JP disqualified" to true and break out of these substeps to outer step 2.

   2. If "escape seen" is false and byte is 0x1B, set "escape seen" to true and continue.

   3. If "escape seen" is true and "second byte in escape" is zero and byte is either 0x28 or 0x24, set "second byte in escape" to byte and continue.

   4. If the pair ("second byte in escape", byte) is any of (0x28, 0x42), (0x28, 0x4A), (0x28, 0x49), (0x24, 0x40) or (0x24, 0x42), return ISO-2022-JP.

   5. If "escape seen" is true, set "ISO-2022-JP disqualified" to true and continue.

   6. Continue.

2. Pass byte to the handler of the EUC-JP decoder.

3. If the handler of the EUC-JP decoder returned error, return Shift_JIS.

4. If the handler of the EUC-JP decoder returned a code point in the range U+3040...U+309F, inclusive, increment "EUC-JP Hiragana" by one.

5. If the handler of the EUC-JP decoder returned a code point in the range U+4E00...U+9FEF, inclusive, increment "EUC-JP Kanji" by one.

6. If the handler of the EUC-JP decoder returned a code point in the range U+30A0...U+30FF, inclusive, increment "EUC-JP Katakana" by one.

7. Pass byte to the handler of the Shift_JIS decoder.

8. If the handler of the Shift_JIS decoder returned error, return EUC-JP.

9. If the handler of the Shift_JIS decoder returned a code point in the range U+3040...U+309F, inclusive, increment "Shift_JIS Hiragana" by one.

10. If the handler of the Shift_JIS decoder returned a code point in the range U+4E00...U+9FEF, inclusive, increment "Shift_JIS Kanji" by one.

11. If the handler of the Shift_JIS decoder returned a code point in the range U+30A0...U+30FF, inclusive, increment "Shift_JIS Katakana" by one.

12. If byte is end-of-stream, return EUC-JP if "EUC-JP Kanji", "EUC-JP Hiragana"  "EUC-JP Katakana" are closer to a 30%, 60% and 10% distribution than "Shift_JIS Kanji", "Shift_JIS Hiragana" and "Shift_JIS Katakana" and Shift_JIS otherwise. (Percentages according to Lunde.)

TODO: Define the math for establishing which distribution is closer.

TODO: Should step 12 run before end-of-stream if some magic number of bytes has been seen?

TODO: Re-check the Hiragana vs. Kanji frequencies, e.g. [Wikipedia cites very different frequencies](https://en.wikipedia.org/wiki/Japanese_writing_system#Statistics).

Note: This issue is not for requesting spec action before we figure out if this is still needed, if the above works and if the ISO-2022-JP part should be included.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/157

Received on Friday, 28 September 2018 10:45:01 UTC