Re: [whatwg/encoding] Amount of bytes to sniff for encoding detection (#102) from Henri Sivonen on 2020-06-08 (public-webapps-github@w3.org from June 2020)

From: Henri Sivonen <notifications@github.com>
Date: Mon, 08 Jun 2020 05:03:21 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/102/640560637@github.com>

In the non-file: URL case for unlabeled content, since Firefox 73, when the meta prescan fails at 1024 bytes, Firefox feeds those 1024 bytes to [chardetng](https://hsivonen.fi/chardetng/), asks chardetng to make a guess, and then starts parsing HTML based on that guess. All the rest of the bytes ares still fed to chardetng in addition to being fed to the HTML parser. When the EOF is reached, chardetng guesses again. If the guess differs from the previous guess, the page is reloaded with this second guess applied.

chardetng does not run on the .in, .lk, and .jp TLDs. The two former don't trigger a detector at all, and .jp uses a special-purpose detector discussed in [another issue](https://github.com/whatwg/encoding/issues/157).

As for Safari, I didn't find any evidence of the ICU-based detector mentioned in the previous comment being in use in Safari on Mac. However, the Japanese-specific detector in WebKit appears to be in effect when Safari's UI language is set to Japanese.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/102#issuecomment-640560637

Received on Monday, 8 June 2020 12:04:08 UTC