Re: [whatwg/encoding] Amount of bytes to sniff for encoding detection (#102) from alexelias on 2017-05-19 (public-webapps-github@w3.org from May 2017)

From: alexelias <notifications@github.com>
Date: Fri, 19 May 2017 12:15:43 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/102/302788418@github.com>

You can see the API to the CED detector in https://github.com/google/compact_enc_det/blob/master/compact_enc_det/compact_enc_det.h .

> we essentially stop the parser if we hit non-ASCII in the first 4k and then wait for the 4k or end-of-file to determine the encoding for the remainder of the bytes? [...] Does it only run as a last resort?

My thinking (and perhaps we can spec this) is that if the algorithm would go:
- If the charset is specified in the HTTP header or in the first 1024 bytes, then we take that as gospel and  not run the detector at all.
- Failing that, we would run the content detector on the first 1024 bytes.  If the detector returns `is_reliable` as true, we stop there and reparse just as if the meta charset had been set.
- Failing that, we wait for 4096 bytes and run the content detector on them (without feeding them into the parser, and without populating `meta_charset_hint` even if it's present in these 4096 bytes).  Whether or not `is_reliable` is true, we treat this as the true final encoding.  We reparse if needed with the detected encoding for this and never run the detector on any further bytes.

> Does that also mean we'd never detect ISO-2022-JP or UTF-16, which are ASCII-incompatible?

I think you meant to say UTF-7?  We have the choice, we can set `ignore_7bit_mail_encodings` parameter to exclude them (and I think we should because of security concerns).

> Rather than using the user's locale, use the locale inferred from the domain's TLD. It would be interesting to know how that contrasts with a detector.

I agree TLD is an important signal.  The CED detector already takes in a "url" parameter of which it primarily looks at the TLD.  It uses it as a tiebreaker in cases where the content encoding is ambiguous.  The percentages in the data above already take the URL into consideration, so the 84% success rate represents the best we can do given the first 1024 bytes plus the TLD.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/102#issuecomment-302788418

Received on Friday, 19 May 2017 19:16:16 UTC