Re: Re: Guessing the fallback encoding from the top-level domain name before trying to guess from the browser localization from Henri Sivonen on 2014-02-26 (www-international@w3.org from January to March 2014)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Wed, 26 Feb 2014 13:25:17 +0200
To: John Cowan <cowan@mercury.ccil.org>
Cc: "www-international@w3.org" <www-international@w3.org>
Message-ID: <CANXqsRLgjyUyLNqQFe5uMA48osyYyNSCH3cU5_N2rE7Se8sqXw@mail.gmail.com>

On Mon, Feb 24, 2014 at 9:31 PM, John Cowan <cowan@mercury.ccil.org> wrote:
> John Cowan scripsit:
>
>> That works only if you can get 100% of the documents on the legacy Web
>> correctly labeled or correctly guessable.  Until recently, there was a
>> spectacular bug in Chrome whereby any table-of-contents page generated by
>> latex2html, e.g. <http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-2.html>,
>> would show up as Chinese mojibake.  Without the ability to change the
>> interpretation of the encoding, the only alternative would have been to
>> load another browser just to read those pages (I used IETab on that site
>> for a while).
>
> I found a page with the same or a similar problem just today:
> <http://www.gnu.org/software/freefont/coverage.html> is pure ASCII
> and lacks a Content-Encoding: header, yet Chrome renders it as
> mojibake if you set the encoding to 8859-1 before loading the page.
> It does display correctly if you then change the encoding to UTF-8.
> That shouldn't happen.

It indeed shouldn't. The page looks reasonable in a hex editor. Have
you filed a Chrome bug? I think the conclusion is that Chrome needs to
fix something (works for me, though) and not that the pages you
mention are justifications for character encoding override UI.

- -

On the original topic of the thread:

The TLD-based guessing feature is on Firefox trunk now. However, not
all country TLDs are participating. I figured it is better to leave
unsure cases the way they were. It doesn't make sense to put a lot of
effort into
researching those before seeing if the general approach works for the
case that it was designed for, specifically Traditional Chinese (but
see below about Traditional Chinese!). The success metric I expect to
be looking at is if the usage of the character encoding menu (whether
it falls).

If this change turns out to be successful for the first batch of
obvious TLDs then I think  it will be worthwhile to research the
unobvious cases.

The TDLs listed in
https://mxr.mozilla.org/mozilla-central/source/dom/encoding/nonparticipatingdomains.properties
do not participate at present (i.e. get a browser UI
localization-based guess like before). The TLDs listed in
https://mxr.mozilla.org/mozilla-central/source/dom/encoding/domainsfallbacks.properties
get the fallbacks listed in that file. All other TLDs map to
windows-1252.

Since landing the feature, I've learned that I've misattributed the
cause of high frequency of character encoding menu usage in the case
of the Traditional Chinese localization. We've been shipping after the
wrong fallback encoding (UTF-8) even after the fallback encoding was
supposedly fixed (to Big5). This shows what kind of a mess our
previous mechanism for setting the fallback encoding in a
locale-dependent way was. The fallback encoding for Traditional
Chinese will change to Big5 for real in Firefox 28.

I might have improved (hopefully; to be seen still) Firefox for the
wrong reason. Oops. :-)

Also, more baseline telemetry data (i.e. data without TLD-based
guessing) is now available. The last 3 weeks of Firefox 25 on the
release channel:
https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8381393 . The
last 3 weeks of Firefox 26 on the release channel:
https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8381394 . The
rows for locales with such little usage overall that even a couple of
sessions with the encoding menu use puts them to the top of the list
percentage-wise are grayed. In both cases, the top entries in black
are Traditional Chinese and Thai, both of which have the wrong
fallback due to the mess that the old localizability mechanism was. Up
next are CJK followed by the Cyrillic locales that have a detector on
by default (Russian and Ukrainian), which makes one wonder if the
detectors are doing more harm than good. Up next is Arabic, which has
the wrong fallback.

These wrong fallbacks are fixed in Firefox 28. In Firefox 28, no
locale falls back to UTF-8. Therefore, new baseline data from Firefox
28 and 29 is needed before eventually comparing with Firefox 30.

-- 
Henri Sivonen
hsivonen@hsivonen.fi
https://hsivonen.fi/

Received on Wednesday, 26 February 2014 11:25:49 UTC