Re: [whatwg/encoding] Add UTF-7 to replacement encoding list? / Encoding sniffing (#68) from JinsukKim on 2017-01-17 (public-webapps-github@w3.org from January 2017)

From: JinsukKim <notifications@github.com>
Date: Mon, 16 Jan 2017 16:39:41 -0800
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/68/272993181@github.com>

Realized these questions better be answered by the person who ported the new detector to Blink (which is me):

> Did you explore guessing the encoding from the top-level domain name instead of guessing it from content?

TLD is utilized in the way that it is used to set the default encoding which is chosen for the encoding of the document if auto-detection fails to work or doesn't get triggered for whatever reason.  So it is involved in the overall decoding process but not exactly in detection step itself. 

> How many bytes do you feed to the detector?

To be precise, that's out of the scope of the detector but rather how Blink decoder is designed. It feeds the first chunk of the document received from network, which in almost most cases big enough for the detector to come up with the right guess. 

> What happens if the network stalls before that many bytes have been received? Do you wait or do you stop detecting based on a timer? (Timers have their own problems, and Firefox got rid of the timer in Firefox 4.)
Once the detector has seen as many bytes as you usually feed it, do you continue feeding data to the detector to allow it to revise its guess? If yes, what do you do if the detector revises its guess?

As answered above, Blink is designed to feed the detector with the first chunk of the data only. After that, no more data is fed to it, therefore no revising its guess either. There is a way to artificially force the first chunk to be small so as not to have significant part of the document content but only have tags and scripts that can be viewed in ascii, which will trick the detector into thinking it is in plain ascii. But it would rarely happen in real life.
One possible bug I'm aware of it that it is possible that a document is written in plain ascii in most of its text but has a couple of symbols in UTF-8, say, at the end of it. The detector guesses it is in ascii because it never sees the symbols if it is not in the first chunk. https://bugs.chromium.org/p/chromium/issues/detail?id=233002


> Other than ISO-2022-JP (per your previous comment) and, I'm extrapolating, EUC-JP, can the detector (as used in Blink) guess an encoding that wasn't previously the default for any Chrome localization? That is, does the detector make pages "work" when they previously didn't for any localization defaults?

Not sure if I got the question right - AFAICT, Blink respects the detected encoding and uses it for decoding documents regardless of the localization default.

> Can the detector (as used in Blink) guess UTF-8? That is, can Blink's use of the detector inadvertently encourage the non-labeling of newly-created UTF-8 content?

Yes it can also detect UTF-8. I wouldn't view it 'inadvertently encourage' non-labelling though. Most of the modern web sites go with UTF-8, and label the documents well. I don't have the stats on how it actually contributes to them neglecting labelling properly. The detector is more for legacy web sites.  

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/68#issuecomment-272993181

Received on Tuesday, 17 January 2017 00:40:16 UTC