Re: [whatwg/encoding] Add UTF-7 to replacement encoding list? / Encoding sniffing (#68) from Henri Sivonen on 2017-01-20 (public-webapps-github@w3.org from January 2017)

From: Henri Sivonen <notifications@github.com>
Date: Fri, 20 Jan 2017 01:11:37 -0800
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/68/274018028@github.com>

> What are 'new' TLDs?

I mean TLDs that are neither the ancient ones (.com, .org, .net, .edu, .gov, .mil) nor country TLDs. That is, TLDs like .mobi, .horse, .xyz, .goog, etc.

> What I can say is Blink makes use of a typical set of TLDs and mappings to default encoding which can be found in Chromium/Android resource.

Where can I find these mappings? (I searched the Chromium codebase for Punycode country TLDs but didn't find this kind of mapping for those.)

>> When the guess is "ASCII", does it result in a windows-1252 decoder being instantiated? Or the decoder being chosen based on the TLD? Or something else?
>
> Yes it goes with window-1252.

What kind of situation is the TLD used in, then?

> I don't know what belongs to the second set.

The second set is:

* EUC-JP
* ISO-2022-JP
* IBM866 
* ISO-8859-3
* ISO-8859-4
* ISO-8859-5
* ISO-8859-6
* ISO-8859-8
* ISO-8859-8-I
* ISO-8859-10
* ISO-8859-13
* ISO-8859-14
* ISO-8859-15
* ISO-8859-16
* KOI8-R
* KOI8-U
* macintosh
* x-mac-cyrillic
* x-user-defined
* gb18030

(In Gecko, the Japanese and Cyrillic encodings are potential detector outcomes, but my current belief (educated guess, not based on proper data) is that the Cyrillic detectors do more harm than good in Gecko, and I want to remove them. Also, ISO-2022-JP has a structure that's known-dangerous on a general level and shown-dangerous for other encodings with similar structure, and ISO-2022-JP is just waiting for someone to demonstrate an attack. In Gecko, the Japanese detector is enabled for the Japanese locale, the Russian detector is enabled for the Russian locale and  the Ukrainian detector is enabled for the Ukrainian locale. Other localizations, including Cyrillic ones, ship with detector off.)

> Maybe you can see it for yourself here about what the detector can return: https://cs.chromium.org/chromium/src/third_party/ced/src/compact_enc_det/compact_enc_det_unittest.cc?l=4733

So the answer is that it can return non-Japanese and even non-Cyrillic guesses from the second set. :-(

>> The vast majority of Web authors have not read your comment here and don't know how Blink works. Since Web authors do what seem to work, they don't know that UTF-8 sniffing is unreliable until it fails them. As noted in your comment about the case where the first network buffer is all ASCII, UTF-8 sniffing in Blink is unreliable. Some Web authors are going to feel free not to label UTF-8 as long as they haven't experienced the failure yet (i.e. they only test in Blink and e.g. use a language that makes it probable for <title> to have some non-ASCII). This is bad.
>
> I can see your point. What would be your suggestion?

Never guessing UTF-8.

>> Have you estimated the success rate of looking at the TLD only versus looking at the first network buffer and falling back to looking at the TLD? How significant a win is it to look inside the first network buffer when you have the capability to look at the TLD?
>
> Unfortunately I don't have a good estimation on that.

That seems like the fundamental question when it comes to assessing whether the detector is necessary complexity assuming the absence of the manual override menu or if the detector adds unnecessary complexity to the Web Platform.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/68#issuecomment-274018028

Received on Friday, 20 January 2017 09:12:14 UTC