Re: [whatwg/encoding] Amount of bytes to sniff for encoding detection (#102) from Henri Sivonen on 2017-05-05 (public-webapps-github@w3.org from May 2017)

From: Henri Sivonen <notifications@github.com>
Date: Fri, 05 May 2017 02:21:21 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/102/299419881@github.com>

I think we should do 1024 bytes for consistency with `meta` even though it gives relatively little opportunity for sniffing when there's no non-ASCII in `title` and a lot of scripts/styles. The number 1024 comes from WebKit legacy, IIRC.

Firefox had the 1024-byte limit for content-based sniffing during the Firefox 4 development cycle. However, that was known to break one site: [Japanese Planet Debian](http://planet.debian.or.jp/) (EUC-JP, but these days properly declared). Decision-makers postulated that the one site would only a tip of an iceberg, so I had to change the detector to run on more bytes.

So today, it the Japanese detector is enabled, it runs on the first 1024 bytes before the parse starts. Then in continues to run, and if it revises its guess during the parse, the page in re-navigated to with the newly-guessed encoding.

I guess we should gather telemetry to see how often this happens.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/102#issuecomment-299419881

Received on Friday, 5 May 2017 09:21:57 UTC