Re: [whatwg/encoding] Consider defining Japanese encoding sniffing (#157)

> closer to a 30%, 60% and 10% distribution

Japanese Wikipedia has 46% kanji, 28% hiragana, 27% katakana, and way less than 1% half-width katakana. However, article titles in Japanese Wikipedia have 42% kanji, 5% hiragana, 53% katakana, and almost no half-width katakana.

This suggests that it's a bad idea to expect general hiragana to katakana ratio if a detector only checks the first 1024 bytes of an HTML document and can expect to see the page title.

In general, looking at what happens to misinterpreted kana between Shift_JIS and EUC-JP, kana ratio seems like a moot issue, but half-width katakana showing up is a very good indicator of having the wrong encoding.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/157#issuecomment-482162553

Received on Thursday, 11 April 2019 15:29:07 UTC