- From: Henri Sivonen <hsivonen@hsivonen.fi>
- Date: Fri, 9 May 2014 13:40:00 +0300
- To: WHATWG <whatwg@whatwg.org>
A while ago, Hixie pinged me on IRC to ask if there are any news about the character encoding stuff. While there are no news yet about guessing the fallback encoding from the TLD of the site, there are now some news about guessing the fallback encoding from the locale. Data for Firefox 25: https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8381393 Data for Firefox 26: https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8381394 Data for Firefox 27: https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8420031 Data for Firefox 28: https://bug965707.bugzilla.mozilla.org/attachment.cgi?id=8420032 Specific findings: 1) Prior to Firefox 28, Traditional Chinese Firefox had a bug that caused the fallback to be UTF-8. Changing the fallback to Big5 in Firefox 28 reduced the usage of the Character Encoding menu. (Please note, however, that Firefox's notion of Big5 does not yet comply with the Encoding Standard notion of Big5.) 2) Prior to Firefox 28, Thai Firefox had a bug that caused the fallback to be windows-1252. Changing the fallback to windows-874 in Firefox 28 reduced the usage of the Character Encoding menu. There were also other locales that had their fallback corrected per spec in Firefox 28. However, for those locales, the changes were within the variation seen between releases previously. I think the finding about Traditional Chinese supports the conclusion that we should not fall back to UTF-8 everywhere. I think the finding about Thai supports a conclusion that we should not fall back on windows-1252 everywhere. However, the results being in the noise for some locales that had their fallback changes suggest that the labeling practice isn't uniform around the world and some locales are relying on the fallback less than others. Since locales using a non-Latin script are the leaders in Character Encoding menu use even when there's only one dominant legacy encoding within the locale, it seems that there is a continued tension between the locale-specific fallback and fallback to windows-1252. Guessing the fallback from the TLD is supposed to address this. I will report findings once the TLD guessing has been on the release channel for six weeks. Also, the relatively high level of Character Encoding menu use for the Korean locale continues to puzzle me. From looking at the mere structure of the legacy or the neighboring locales being different, one should expect the situation with the Korean locale and the Hebrew locales to be very similar. Yet, it is not. Finally worth noting: Firefox is committing a willful violation of the spec when it comes to Simplified Chinese: The spec says gb18030, but Firefox uses gbk. Starting with Firefox 29, the gbk *decoder* will be the same as the gb18030 decoder. However, because we've previously seen problems with EUC-JP and Big5 when expanding the range of byte sequences that an *encoder* can produce in form submission, we are keeping the gbk encoder distinct from the gb18030 at least for now. I'm willing to reconsider if another browser (that has high market share in China) successfully starts using the gb18030 encoder for form submissions for sites that declare gbk (or gb2312) or don't declare an encoding. -- Henri Sivonen hsivonen@hsivonen.fi https://hsivonen.fi/
Received on Friday, 9 May 2014 10:40:26 UTC