- From: alexelias <notifications@github.com>
- Date: Fri, 19 May 2017 12:15:43 -0700
- To: whatwg/encoding <encoding@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/encoding/issues/102/302788418@github.com>
You can see the API to the CED detector in https://github.com/google/compact_enc_det/blob/master/compact_enc_det/compact_enc_det.h . > we essentially stop the parser if we hit non-ASCII in the first 4k and then wait for the 4k or end-of-file to determine the encoding for the remainder of the bytes? [...] Does it only run as a last resort? My thinking (and perhaps we can spec this) is that if the algorithm would go: - If the charset is specified in the HTTP header or in the first 1024 bytes, then we take that as gospel and not run the detector at all. - Failing that, we would run the content detector on the first 1024 bytes. If the detector returns `is_reliable` as true, we stop there and reparse just as if the meta charset had been set. - Failing that, we wait for 4096 bytes and run the content detector on them (without feeding them into the parser, and without populating `meta_charset_hint` even if it's present in these 4096 bytes). Whether or not `is_reliable` is true, we treat this as the true final encoding. We reparse if needed with the detected encoding for this and never run the detector on any further bytes. > Does that also mean we'd never detect ISO-2022-JP or UTF-16, which are ASCII-incompatible? I think you meant to say UTF-7? We have the choice, we can set `ignore_7bit_mail_encodings` parameter to exclude them (and I think we should because of security concerns). > Rather than using the user's locale, use the locale inferred from the domain's TLD. It would be interesting to know how that contrasts with a detector. I agree TLD is an important signal. The CED detector already takes in a "url" parameter of which it primarily looks at the TLD. It uses it as a tiebreaker in cases where the content encoding is ambiguous. The percentages in the data above already take the URL into consideration, so the 84% success rate represents the best we can do given the first 1024 bytes plus the TLD. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/whatwg/encoding/issues/102#issuecomment-302788418
Received on Friday, 19 May 2017 19:16:16 UTC