Re: [whatwg/encoding] Amount of bytes to sniff for encoding detection (#102) from alexelias on 2017-05-08 (public-webapps-github@w3.org from May 2017)

From: alexelias <notifications@github.com>
Date: Mon, 08 May 2017 12:38:29 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/102/299968814@github.com>

Why would consistency with be `<meta>` be a goal here?  I can't think of an argument for a concrete benefit of such consistency.

`<meta>` is a header thing whereas the bulk of bytes revealing the encoding are going to be part of `<body>`.  The only guaranteed-useful part of the header is `<title>`, but it may well end up near the end of the header, or be too short for reliable detection.  So, even in theory, it makes sense to choose a different constant for encoding detection.  I think the principled target for the constant should be "larger than typical header length".

> Then in continues to run, and if it revises its guess during the parse, the page in re-navigated to with the newly-guessed encoding.

This sounds complex and bug-prone, so I don't think we would be willing to introduce similar behavior in Chromium.  I would much rather increase the constant than resort to this.

> I guess we should gather telemetry to see how often this happens.

Yes, I would be very interested in hearing the results of telemetry.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/102#issuecomment-299968814

Received on Monday, 8 May 2017 19:39:03 UTC