Re: [whatwg/encoding] Amount of bytes to sniff for encoding detection (#102) from Anne van Kesteren on 2017-05-19 (public-webapps-github@w3.org from May 2017)

From: Anne van Kesteren <notifications@github.com>
Date: Fri, 19 May 2017 00:44:27 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/102/302633363@github.com>

I'd like some more detail on this detector. I assume the goal is to avoid reloading, so would we essentially stop the parser if we hit non-ASCII in the first 4k and then wait for the 4k or end-of-file to determine the encoding for the remainder of the bytes?

Does that also mean we'd never detect ISO-2022-JP or UTF-16, which are ASCII-incompatible?

Does it only run as a last resort?

I also know that Henri has been exploring alternative strategies for locale-specific fallback. Rather than using the user's locale, use the locale inferred from the domain's TLD. It would be interesting to know how that contrasts with a detector.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/102#issuecomment-302633363

Received on Friday, 19 May 2017 07:45:09 UTC