[whatwg/encoding] Amount of bytes to sniff for encoding detection (#102) from JinsukKim on 2017-04-25 (public-webapps-github@w3.org from April 2017)

From: JinsukKim <notifications@github.com>
Date: Mon, 24 Apr 2017 18:13:03 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/102@github.com>

This issue is kind of spinned off of https://github.com/whatwg/encoding/issues/68 to have a bug dedicated for the specific issue about the data size to be used for encoding detection, which was raised in the thread.

Currently, Blink feeds encoding detector the first chunk of data it gets from network to guess the text encoding used in a give document. The size varies, depending on external conditions like network configuration, and it is possible that the detected encoding result also also vary. I'm experimenting the idea discussed in https://bugs.chromium.org/p/chromium/issues/detail?id=691985#c16 so that always the same amount of data for a document will be fed to encoding detector in order to get consistent result regardless of other conditions.

The size I'm thinking of is 4K though 1K was initially suggested in the Chromium bug entry. Size of the entire document will be used if it is smaller than that. Are there any values used in other browsers for a reference? I prefer more than 1K for encoding detection since in some anecdotal observations I made, documents have ascii-only tags and scripts in the first 1K, which are not enough clue for the detector to make a right guess with.

FWIW, Blink looks at the entire <head> for a charset meta tag, or scans up to first 1024 bytes to find one even past <head>. There must have been a historical reason behind that I'm not aware of.

--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/102

Received on Tuesday, 25 April 2017 01:13:39 UTC