- From: Henri Sivonen <hsivonen@hsivonen.fi>
- Date: Wed, 26 Feb 2014 13:25:17 +0200
- To: John Cowan <cowan@mercury.ccil.org>
- Cc: "www-international@w3.org" <www-international@w3.org>
On Mon, Feb 24, 2014 at 9:31 PM, John Cowan <cowan@mercury.ccil.org> wrote: > John Cowan scripsit: > >> That works only if you can get 100% of the documents on the legacy Web >> correctly labeled or correctly guessable. Until recently, there was a >> spectacular bug in Chrome whereby any table-of-contents page generated by >> latex2html, e.g. <http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-2.html>, >> would show up as Chinese mojibake. Without the ability to change the >> interpretation of the encoding, the only alternative would have been to >> load another browser just to read those pages (I used IETab on that site >> for a while). > > I found a page with the same or a similar problem just today: > <http://www.gnu.org/software/freefont/coverage.html> is pure ASCII > and lacks a Content-Encoding: header, yet Chrome renders it as > mojibake if you set the encoding to 8859-1 before loading the page. > It does display correctly if you then change the encoding to UTF-8. > That shouldn't happen. It indeed shouldn't. The page looks reasonable in a hex editor. Have you filed a Chrome bug? I think the conclusion is that Chrome needs to fix something (works for me, though) and not that the pages you mention are justifications for character encoding override UI. - - On the original topic of the thread: The TLD-based guessing feature is on Firefox trunk now. However, not all country TLDs are participating. I figured it is better to leave unsure cases the way they were. It doesn't make sense to put a lot of effort into researching those before seeing if the general approach works for the case that it was designed for, specifically Traditional Chinese (but see below about Traditional Chinese!). The success metric I expect to be looking at is if the usage of the character encoding menu (whether it falls). If this change turns out to be successful for the first batch of obvious TLDs then I think it will be worthwhile to research the unobvious cases. The TDLs listed in https://mxr.mozilla.org/mozilla-central/source/dom/encoding/nonparticipatingdomains.properties do not participate at present (i.e. get a browser UI localization-based guess like before). The TLDs listed in https://mxr.mozilla.org/mozilla-central/source/dom/encoding/domainsfallbacks.properties get the fallbacks listed in that file. All other TLDs map to windows-1252. Since landing the feature, I've learned that I've misattributed the cause of high frequency of character encoding menu usage in the case of the Traditional Chinese localization. We've been shipping after the wrong fallback encoding (UTF-8) even after the fallback encoding was supposedly fixed (to Big5). This shows what kind of a mess our previous mechanism for setting the fallback encoding in a locale-dependent way was. The fallback encoding for Traditional Chinese will change to Big5 for real in Firefox 28. I might have improved (hopefully; to be seen still) Firefox for the wrong reason. Oops. :-) Also, more baseline telemetry data (i.e. data without TLD-based guessing) is now available. The last 3 weeks of Firefox 25 on the release channel: https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8381393 . The last 3 weeks of Firefox 26 on the release channel: https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8381394 . The rows for locales with such little usage overall that even a couple of sessions with the encoding menu use puts them to the top of the list percentage-wise are grayed. In both cases, the top entries in black are Traditional Chinese and Thai, both of which have the wrong fallback due to the mess that the old localizability mechanism was. Up next are CJK followed by the Cyrillic locales that have a detector on by default (Russian and Ukrainian), which makes one wonder if the detectors are doing more harm than good. Up next is Arabic, which has the wrong fallback. These wrong fallbacks are fixed in Firefox 28. In Firefox 28, no locale falls back to UTF-8. Therefore, new baseline data from Firefox 28 and 29 is needed before eventually comparing with Firefox 30. -- Henri Sivonen hsivonen@hsivonen.fi https://hsivonen.fi/
Received on Wednesday, 26 February 2014 11:25:49 UTC