Re: [whatwg/encoding] Add UTF-7 to replacement encoding list? / Encoding sniffing (#68)

> Realized these questions better be answered by the person who ported the new detector to Blink

Thank you for the answer.

> TLD is utilized in the way that it is used to set the default encoding which is chosen for the encoding of the document if auto-detection fails to work or doesn't get triggered for whatever reason.

Have you estimated the success rate of looking at the TLD only versus looking at the first network buffer and falling back to looking at the TLD? How significant a win is it to look inside the first network buffer when you have the capability to look at the TLD?

Do you guess windows-1252 for the new TLDs, like Gecko does?

> It feeds the first chunk of the document received from network, which in almost most cases big enough for the detector to come up with the right guess.

I'm quite unhappy about Blink reintroducing a dependency on network buffer boundaries in text/html handling without discussion at the WHATWG prior to making the change.

For historical context, prior to Firefox 4 text/html handling in Firefox was dependent on where in the input stream the buffer boundaries fell. This was bad, and getting rid of this, thanks to the inside from WebKit, was one of the accomplishments of the HTML parsing standardization effort at the WHATWG.

> There is a way to artificially force the first chunk to be small so as not to have significant part of the document content but only have tags and scripts that can be viewed in ascii, which will trick the detector into thinking it is in plain ascii. But it would rarely happen in real life.

Unfortunately, the Web is vast enough for rare cases to be real.

> One possible bug I'm aware of it that it is possible that a document is written in plain ascii in most of its text but has a couple of symbols in UTF-8, say, at the end of it. The detector guesses it is in ascii because it never sees the symbols if it is not in the first chunk.

When the guess is "ASCII", does it result in a windows-1252 decoder being instantiated? Or the decoder being chosen based on the TLD? Or something else?

> https://bugs.chromium.org/p/chromium/issues/detail?id=233002

Even if the HTML parser has only seen ASCII by the time it revises its encoding detection guess, the encoding may already have been inherited to CSS, JS and iframes, so revising the guess without causing a full reload seems like a bad idea.

>> Other than ISO-2022-JP (per your previous comment) and, I'm extrapolating, EUC-JP, can the detector (as used in Blink) guess an encoding that wasn't previously the default for any Chrome localization? That is, does the detector make pages "work" when they previously didn't for any localization defaults?
>
> Not sure if I got the question right - AFAICT, Blink respects the detected encoding and uses it for decoding documents regardless of the localization default.

I meant that there is a set of encodings that were the default in some localization and a set of encodings that were not to the default in any localization. When the guess is from the first set, the behavior that ensues matches the behavior of some pre-existing default configuration. When the guess is from the second set, a new Web-exposed behavior has been introduced. Apart from Japanese encodings, does Blink ever guess an encoding from the second set (e.g. ISO-8859-13)?

> Yes it can also detect UTF-8. I wouldn't view it 'inadvertently encourage' non-labelling though.

The vast majority of Web authors have not read your comment here and don't know how Blink works. Since Web authors do what seem to work, they don't know that UTF-8 sniffing is unreliable until it fails them. As noted in your comment about the case where the first network buffer is all ASCII, UTF-8 sniffing in Blink is unreliable. Some Web authors are going to feel free not to label UTF-8 as long as they haven't experienced the failure yet (i.e. they only test in Blink and e.g. use a language that makes it probable for `<title>` to have some non-ASCII). This is bad.

(Detecting UTF-8 for file:-URL HTML by looking at the *entire* file is fine, IMO, and I want to implement that in a Gecko in due course.)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/68#issuecomment-273048092

Received on Tuesday, 17 January 2017 08:12:09 UTC