[whatwg] Update on fallback encoding findings from Henri Sivonen on 2014-05-09 (public-whatwg-archive@w3.org from May 2014)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Fri, 9 May 2014 13:40:00 +0300
To: WHATWG <whatwg@whatwg.org>
Message-ID: <CANXqsR+4Vi3cZDBeXc-u8Om3vLiZKCFvRkWA+zUcEfMh+6=yXQ@mail.gmail.com>
A while ago, Hixie pinged me on IRC to ask if there are any news about
the character encoding stuff. While there are no news yet about
guessing the fallback encoding from the TLD of the site, there are now
some news about guessing the fallback encoding from the locale.

Data for Firefox 25:
https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8381393

Data for Firefox 26:
https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8381394

Data for Firefox 27:
https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8420031

Data for Firefox 28:
https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8420032

Specific findings:
 1) Prior to Firefox 28, Traditional Chinese Firefox had a bug that
caused the fallback to be UTF-8. Changing the fallback to Big5 in
Firefox 28 reduced the usage of the Character Encoding menu. (Please
note, however, that Firefox's notion of Big5 does not yet comply with
the Encoding Standard notion of Big5.)

 2) Prior to Firefox 28, Thai Firefox had a bug that caused the
fallback to be windows-1252. Changing the fallback to windows-874 in
Firefox 28 reduced the usage of the Character Encoding menu.

There were also other locales that had their fallback corrected per
spec in Firefox 28. However, for those locales, the changes were
within the variation seen between releases previously.

I think the finding about Traditional Chinese supports the conclusion
that we should not fall back to UTF-8 everywhere. I think the finding
about Thai supports a conclusion that we should not fall back on
windows-1252 everywhere. However, the results being in the noise for
some locales that had their fallback changes suggest that the labeling
practice isn't uniform around the world and some locales are relying
on the fallback less than others.

Since locales using a non-Latin script are the leaders in Character
Encoding menu use even when there's only one dominant legacy encoding
within the locale, it seems that there is a continued tension between
the locale-specific fallback and fallback to windows-1252. Guessing
the fallback from the TLD is supposed to address this. I will report
findings once the TLD guessing has been on the release channel for six
weeks.

Also, the relatively high level of Character Encoding menu use for the
Korean locale continues to puzzle me. From looking at the mere
structure of the legacy or the neighboring locales being different,
one should expect the situation with the Korean locale and the Hebrew
locales to be very similar. Yet, it is not.

Finally worth noting: Firefox is committing a willful violation of the
spec when it comes to Simplified Chinese: The spec says gb18030, but
Firefox uses gbk. Starting with Firefox 29, the gbk *decoder* will be
the same as the gb18030 decoder. However, because we've previously
seen problems with EUC-JP and Big5 when expanding the range of byte
sequences that an *encoder* can produce in form submission, we are
keeping the gbk encoder distinct from the gb18030 at least for now.
I'm willing to reconsider if another browser (that has high market
share in China) successfully starts using the gb18030 encoder for form
submissions for sites that declare gbk (or gb2312) or don't declare an
encoding.

-- 
Henri Sivonen
hsivonen@hsivonen.fi
https://hsivonen.fi/
Received on Friday, 9 May 2014 10:40:26 UTC