Re: [whatwg/encoding] Add UTF-7 to replacement encoding list? / Encoding sniffing (#68) from JinsukKim on 2017-01-18 (public-webapps-github@w3.org from January 2017)

From: JinsukKim <notifications@github.com>
Date: Wed, 18 Jan 2017 04:13:29 -0800
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/68/273460147@github.com>

Please note that I'm not working on Blink. And I ported CED as a drop-in replacement of ICU encoding detector for any Blink-based Web browser to benefit from it. I'm not entirely familiar with all the terms used in your questions.

> Do you guess windows-1252 for the new TLDs, like Gecko does?

What are 'new' TLDs? What I can say is Blink makes use of a typical set of TLDs and mappings to default encoding which can be found in Chromium/Android resource. Mapping for countries whose legacy, default encoding is not obvious (ge-Georgian, id-Indonesian, for instance) are not included.

> When the guess is "ASCII", does it result in a windows-1252 decoder being instantiated? Or the decoder being chosen based on the TLD? Or something else?

Yes it goes with window-1252.

> I meant that there is a set of encodings that were the default in some localization and a set of encodings that were not to the default in any localization. When the guess is from the first set, the behavior that ensues matches the behavior of some pre-existing default configuration. When the guess is from the second set, a new Web-exposed behavior has been introduced. Apart from Japanese encodings, does Blink ever guess an encoding from the second set (e.g. ISO-8859-13)?

I don't know what belongs to the second set. Maybe you can see it for yourself here about what the detector can return: https://cs.chromium.org/chromium/src/third_party/ced/src/compact_enc_det/compact_enc_det_unittest.cc?l=4733 

> The vast majority of Web authors have not read your comment here and don't know how Blink works. Since Web authors do what seem to work, they don't know that UTF-8 sniffing is unreliable until it fails them. As noted in your comment about the case where the first network buffer is all ASCII, UTF-8 sniffing in Blink is unreliable. Some Web authors are going to feel free not to label UTF-8 as long as they haven't experienced the failure yet (i.e. they only test in Blink and e.g. use a language that makes it probable for <title> to have some non-ASCII). This is bad.

I can see your point. What would be your suggestion?


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/68#issuecomment-273460147

Received on Wednesday, 18 January 2017 12:14:27 UTC