Re: [whatwg/encoding] Add UTF-7 to replacement encoding list? / Encoding sniffing (#68)

> FYI, CED is open-sourced and available in github. See https://github.com/google/compact_enc_det

I did a brief evaluation of https://github.com/google/compact_enc_det to form an opinion on whether we should adopt it for Gecko. I feel very uneasy about adopting it, mainly because a) it seems bad for the Web Platform to depend on a bunch of mystery C++ as opposed to having a spec with multiple independent interoperable implementations, b) it indeed is mystery C++ in the sense that it doesn't have sufficient comments explaining its design or the rationale behind its design, c) it has a bunch of non-browser motivated code that'd be dead code among the mystery C++ code.

In case we want to develop a spec for sniffing and encoding from content, here are some notes.

It's clear that compact_enc_det has not been developed for the purpose that it is used for in Chromium. It looks like it's been developed for the Google search engine and possibly also for Gmail. As seen from this issue, this results in compact_enc_det doing things that don't really make sense in a Web browser and then having to correct toward something that makes more sense in the browser context.

compact_enc_det does not make use of data tables and algorithms that an implementation of the Encoding Standard would necessarily include. If I was designing a detector, I'd eliminate encodings whenever decoding according to the Encoding Standard according to a candidate encoding would result in error or in C1 controls (also probably half-width katakana, as that's a great way to decide that the encoding is not Shift_JIS).

compact_enc_det analyzes byte bigrams both for single-byte encodings and for legacy CJK encodings. In an implementation that assumes the presence of the Encoding Standard data and algorithms, it should be easy to distinguish between single-byte encodings and legacy CJK encodings by the process of elimination according to the above paragraph: If a legacy CJK encoding remains on the candidate list after seeing a couple of hundred non-ASCII bytes, chances are that the input is in a legacy CJK encoding, because it's so likely that single-byte-encoding content results in a errors when decoded as legacy CJK encodings. (Big5 and Shift_JIS are clearly structurally different from the EUC-JP, EUC-KR and GBK. The decision among the EUC-based encodings, if input is valid according to all of them, could be done as follows: If there's substantial kana according to EUC-JP interpretation, it's EUC-JP. If the two-byte sequences stay mainly in the original KS X 1001 Hangul area, it's EUC-KR. Otherwise, it's GBK.)

compact_enc_det does not assign probabilities to bigrams directly. Instead, it takes the score for the first byte of an "aligned bigram" with the high bit XORed with the high bit of the second byte and the score for the second byte of an "aligned bigram". If the combination of the highest four bits of the first byte and the highest four bits of the second byte yields a marker for a more precise lookup, then a score for the low 5 bits of the first byte combined with the low 5 bits of the second byte is taken. It's unclear why that score isn't used unconditionally. "Aligned" appears to mean aligned to the start of the non-ASCII run, which appears to align for legacy CJK, and the next bigram consist of the third and fourth byte of the run instead of the second and third bite of the run, which makes sense for legacy CJK, but at least superficially looks like an odd choice for analyzing non-Latin single-byte encodings.

I am guessing that "compact" in the name comes from not having a table of ((256 * 256) / 4) * 3 (all byte pairs where at least one byte is non-ASCII) per encoding and instead having the 256-entry table indexed by the combination of four and four high bits to decide if a table of 1024 entries indexed by five and five low bits should be used.

There seems to be a capability for distinguishing between Western, Central European, and Baltic encodings using trigrams, but it's unclear if that code is actually turned on.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/68#issuecomment-490509401

Received on Wednesday, 8 May 2019 14:31:12 UTC