- From: Henri Sivonen <hsivonen@hsivonen.fi>
- Date: Thu, 2 Jan 2014 11:48:37 +0200
- To: "www-international@w3.org" <www-international@w3.org>
On Mon, Dec 23, 2013 at 1:56 AM, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> wrote: > But there is also a chance - especially if the gotcha becomes a > frequent issue - that authors would as well discover how to *trigger* > UTF-8 detection. ... > Why not simply us a BOM ... Right. If you want to use an early UTF-8 byte sequence to trigger UTF-8 treatment, use the BOM as your early UTF-8 byte sequence. It already works cross-browser and cross-locale. (Though, granted, of possible UTF-8 byte sequences, it's the most brittle one in terms of text editors maybe silently removing it.) > Why does Europe’s largest social network, www.vk.com, > use Windows-1251 - even for Asian scripts? >From personal experience from over a decade ago, if you face a legacy code base written without much thought about encodings and a database containing bytes in the local legacy encoding plus numeric character references submitted by browsers and not sanitized in any way, it may be less disruptive to continuous operations to add code that formalizes the use of numeric character references in the database than to migrate everything to UTF-8. Of course, in the long term, you'd have been better off doing the UTF-8 migration up front. On Mon, Dec 23, 2013 at 2:00 AM, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> wrote: > Henri Sivonen, Thu, 19 Dec 2013 16:29:37 +0200: >> >> The list of TLDs that participate in the guessing and are not >> windows-1252-affiliated is currently: >> > https://bugzilla.mozilla.org/attachment.cgi?id=8341644&action=diff#a/dom/encoding/domainsfallbacks.properties_sec2 >> >> UTF-8 is never guessed, since it is not a legacy encoding. > > But not all domains are “legacy domains” either. Consider, from the > above list, line 139 and 140: > > 139 ru=windows-1251 > 140 xn--p1ai=windows-1251 > > where xn--p1ai refers to the RF-domain - .рф. Is there really no > correlation between UTF-8 based domain names and use of the UTF-8 > encoding ... ? xn--p1ai isn't a UTF-8 domain name. It's a Punycode domain name. :-) Anyway, the feature avoids guessing outcomes that aren't already possible under the current localization-based guessing regime. That means never guessing UTF-8. Would you rather guess windows-1252 for xn--p1ai? On Mon, Dec 23, 2013 at 11:17 AM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp> wrote: > Overall, I agree with the question by others of what's the expected "ROI" on > this is. With UTF-8 being more and more popular for Web sites, the return > for changing fallback encodings is definitely deminishing. The return is definitely diminishing, but the issue of undeclared legacy encodings hasn't diminished far enough to make no one ever ask for a character encoding menu. I think is terribly sad that one was already added for Firefox for Android. I think TLD-based guessing will have been worthy if it successfully prevents the addition of a character encoding override menu to the browser app on Firefox OS. (Which in practice means making people who read Chinese and are involved in Firefox OS feature triage not experience the need to override the encoding even when using an en-US build to read unlabeled legacy-encoded Simplified or Traditional Chinese sites or when using zh-TW builds to read unlabeled legacy-encoded Simplified Chinese sites.) I think the mission would be completely accomplished if this feature allowed for the removal of the character encoding menu from all Gecko-based products down the road. The character encoding override menu is not only bad UI but it introduces a lot of complexity to the Web engine if you want to make the Web engine to be secure even when operated by users who don't know about the security properties of character encodings. In 2012 and 2013, I spent a significant amount of time making Gecko secure against the sort of XSS that involves tricking the user to use the character encoding menu and making the character encoding menu UI less terrible in terms of usability. One might take the stance that the bulk of that work is already sunk cost and there's no need to get rid of the menu after it's *almost* been fixed in the browser *I* work on, but I've instead taken the stance than neither I nor anyone else should have to deal with the complication arising from the character encoding menu in the future in Firefox or in another code base. -- Henri Sivonen hsivonen@hsivonen.fi https://hsivonen.fi/
Received on Thursday, 2 January 2014 09:49:05 UTC