- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Sun, 11 Dec 2011 12:21:40 +0100
Henri Sivonen Fri Dec 9 05:34:08 PST 2011: > On Fri, Dec 9, 2011 at 12:33 AM, Leif Halvard Silli: >> Henri Sivonen Tue Dec 6 23:45:11 PST 2011: >> These localizations are nevertheless live tests. If we want to move >> more firmly in the direction of UTF-8, one could ask users of those >> 'live tests' about their experience. > > Filed https://bugzilla.mozilla.org/show_bug.cgi?id=708995 This is brilliant. Looking forward to the results! >>> (which means >>> *other-language* pages when the language of the localization doesn't >>> have a pre-UTF-8 legacy). >> >> Do you have any concrete examples? > > The example I had in mind was Welsh. Logical candidate. WHat do you know about the Farsi and Arabic local? HTML5 specifies UTF-8 for them - due to the way Firefox behaves, I think. IE seems to be the big dominator for these locales - at least in Iran. Firefox was number two in Iran, but still only at around 5 percent, in the stats I saw. Btw, as I looked into Iran a bit ... I discovered that "UNICODE" is used as alias for "UTF-16" in IE and Webkit. And, for XML, then Webkit, Firefox and Opera sees it as a non-fatal error (but Opera just treats all illegal names that way). WHile IE9 seems to se it as fatal. File an HTML5 bug: https://www.w3.org/Bugs/Public/show_bug.cgi?id=15142 Seemingly, this has not affected Firefox users too much. Which must EITHER mean that many of these pages *are* UTF-16 encoded OR that their content is predominantly US-ASCII and thus the artefacts of parsing UTF-8 pages ("UTF-16" should be treated as "UTF-8 when it isn't "UTF-16") as WINDOWS-1252, do not affect users too much. I mention it here for 3 reasons: (1) charset=Unicode inside <meta> is caused by MSHTML, including Word. And Boris mentioned Word's behaviour as a reason for keeping the legacy defaulting. However, when MSHTML saves with charset=UNICODE, then for browsers to legacy default is not the correct behaviour. (I don't know exactly when MSHTML spits out charset=UNICODE, though - or whether it is locale affected whether MSHTML spits out charset=UNICODE - or what.) (2) for the user tests you suggested in Mozilla bug 708995 (above), the presence of <meta charset=UNICODE> would trigger a need for Firefox users to select UTF-8 - unless the locale already defaults to UTF-8; (3) That HTML5 bug 15142 (see above) has been unknown (?) till now, despite that it affects Firefox and Opera, hints that, for the "WINDOWS-1252 languages", when they are served as UTF-8 but parsed as WINDOWS-1252 (by Firefox and Opera), then users survive. (Of course, some of these pages will be "picked up" by an Apache Content-Type: header declaring the encoding or perhaps be chardet? >> And are there user complaints? > > Not that I know of, but I'm not part of a feedback loop if there even > is a feedback loop here. > >> The Serb localization uses UTF-8. The Croat uses Win-1252, but only on >> Windows and Mac: On Linux it appears to use UTF-8, if I read the HG >> repository correctly. > > OS-dependent differences are *very* suspicious. :-( Mmm, yes. >>> I think that defaulting to UTF-8 is always a bug, because at the time >>> these localizations were launched, there should have been no unlabeled >>> UTF-8 legacy, because up until these locales were launched, no >>> browsers defaulted to UTF-8 (broadly speaking). I think defaulting to >>> UTF-8 is harmful, because it makes it possible for locale-siloed >>> unlabeled UTF-8 content come to existence >> >> The current legacy encodings nevertheless creates siloed pages already. >> I'm also not sure that it would be a problem with such a UTF-8 silo: >> UTF-8 is possible to detect, for browsers - Chrome seems to perform >> more such detection than other browsers. > > While UTF-8 is possible to detect, I really don't want to take Firefox > down the road where users who currently don't have to suffer page load > restarts from heuristic detection have to start suffering them. (I > think making incremental rendering any less incremental for locales > that currently don't use a detector is not an acceptable solution for > avoiding restarts. With English-language pages, the UTF-8ness might > not be apparent from the first 1024 bytes.) FIRSTLY, HTML5: ]] 8.2.2.4 Changing the encoding while parsing [...] This might happen if the encoding sniffing algorithm described above failed to find an encoding, or if it found an encoding that was not the actual encoding of the file. [[ Thus, trying to detect UTF-8 is second last step of the sniffing algorithm. If it, correctly, detects UTF-8, then, while the detection probably affects performance, detecting UTF-8 should not lead to a need for re-parsing the page? SECONDLY: If there is a UTF-8 silo - that leads to undeclared UTF-8 pages, then it is the browsers *outside* that silo which eventually suffers (browser that do default to UTF-8 do not need to perform UTF-8 detect, I suppose - or what?). So then it is partly a matter of how large the silo is. Regardless, we must consider: The alternative to undeclared UTF-8 pages would be to be undeclared legacy encoding pages, roughly speaking. Which the browsers outside the silo then would have to detect. And which would be more *demand* to detect than simply detecting UTF-8. However, what you had in min was the change of the default encoding for a particular silo from legacy encoding to UTF-8. This, I agree, would lead to some pages being treated as UTF-8 - to begin with. But when the browser detects that this is wrong, it would have to switch to - probably - the "old" default - the legacy encoding. However, why would it switch *from* UTF-8 if UTF-8 is the default? We must keep the problem in mind: For the siloed browser, UTF-8 will be its fall-back encoding. >> In another message you suggested I 'lobby' against authoring tools. OK. >> But the browser is also an authoring tool. > > In what sense? The problem with defaults is when they take effect without one's knowledge. Or one may think everything is OK, until one sees that it isn't. The Respec.js script works in your browser, and saving the output, is one of the problems it has: http://dev.w3.org/2009/dap/ReSpec.js/documentation.html#saving-the-generated-specification Quote: "And sadly enough browsers aren't very good at saving HTML that's been modified by script." The docs does not discuss the encoding problem. But I believe that is exactly one of the problems it has. * A browser will save the page using the page's encoding. * A browser will not add a META element if the page doesn't have one. Thus, if it is HTTP which specifies the encoding, then saving it on the computer, will mean that the next time it opens - from the hard disk, the page will default to the locale default, meaing that one must select UTF-8 to make the page readable. (MSHTML - aka IE - will add the encoding - such as charset=UNICODE ... - if you switch the encoding during saving - I'm not exactly sure about the requirements.) This probably needs more thought and more ideas, but what can be done to make this better? One reason for the browser to not add <meta charset="something" /> if the page doesn't have it already is, perhaps, that it could be incorrect - may be because the user changed the encoding manually. But if we consider how text editors - e.g. on the Mac - have been working for a while now, then you have to take steps if you *don't* want to save the page as UTF-8. So perhaps browsers could start to behave the same way? That is: Regardless of original encoding, save it as UTF-8, unless the user overrides it? * Another idea: Perform heuristics more extensively when the file is on the hard disk than when it is online? No, this could lead users to think it work because it works offline? >> So how can we have authors >> output UTF-8, by default, without changing the parsing default? > > Changing the default is an XML-like solution: creating breakage for > users (who view legacy pages) in order to change author behavior. That reasoning doesn't consider that everyone that saves an HTML page from the Web to one's hard disk, is an author. One is avoiding to make the roundtrip behaviour more reliable because there exists an ever diminishing amount of legacy encoded pages. > To the extent a browser is a tool Web authors use to test stuff, it's > possible to add various whining to console without breaking legacy > sites for users. See > https://bugzilla.mozilla.org/show_bug.cgi?id=672453 > https://bugzilla.mozilla.org/show_bug.cgi?id=708620 Good stuff! >> Btw: In Firefox, then in one sense, it is impossible to disable >> "automatic" character detection: In Firefox, overriding of the encoding >> only lasts until the next reload. > > A persistent setting for changing the fallback default is in the > "Advanced" subdialog of the font prefs in the "Content" preference > pane. I know. I was not commenting, here, on the "global" default encoding. But instead on a subtle difference between the effect of a manual override in Firefox (and IE) compared to especially Safari. In Safari - if you have e.g. an UTF-8 page that is otherwise correctly made, e.g. with <meta charset="UTF-8"> - then a manual switch to e.g. KOI8-R will have lasting effect, in the current tab: You can reload the page as many times you wish: Each time it will be treated as KOI8-R. While in Firefox and IE, the manual switch to KOI8-R only lasts for one reload. Next time you reload, the browser will listen to the encoding signals from the page and from the server again. Opera, instead, remembers your manual switch of the encoding even if you try to open the page in a new tab or window and even after a browser restart - Opera is alone in doing this, which I think is agains HTML5: HTML5 only allows the browser to override what the page says *provided* that the page doesn't say anything ... (As such, even the Safari behaviour is dubious, I'd say. FWIT, iCab allows you to tell it to "please start listen to the signals from the page and server, again".) SO: What I meant by "impossible to disable", thus, was that Firefox and IE, from the user's perspective, behaves "automatic" even if the auto-detect is disabled: They listen to the signals from the page and server rather than, like Safari and Opera, listen to the "last signal from the user". > It's rather counterintuitive that the persistent autodetection > setting is in the same menu as the one-off override. You talk about View->Character_Encoding->Auto-Detect->Off ? Anyway: I agree that the encoding menus could be simpler/clearer. I think the most counter-intuitive thing is to use the word "auto-detect" about the heuristic detection - see what I said above about "behaves automatic even when auto-detect is disabled". Opera's default setting is called "Automatic selection". So it is "all automatic" ... > As for heuristic detection based on the bytes of the page, the only > heuristic that can't be disabled is the heuristic for detecting > BOMless UTF-16 that encodes Basic Latin only. (Some Indian bank was > believed to have been giving that sort of files to their customers and > it "worked" in pre-HTML5 browsers that silently discarded all zero > bytes prior to tokenization.) The Cyrillic and CJK detection > heuristics can be turned on and off by the user. I always wondered what the "Universal" detection meant. Is that simply the UTF-8 detection? Or does it also detect other encodings? Unicode is sometimes referred to as the "Universal" encoding/character set. If that is what it means, then "Unicode" would have been cleared than "Universal". Hm, it seems like it is meant Universal and not Unicode, in which case "All" or similar would have been better ... So it seems to me that it is not possible to *enable* only *UTF-8 detection*: The only option for getting UTF-8 detection is to use the Universal detection - which enables everything. It seems to me that if you offered *only* UTF-8 detection, then you would have something useful up in your sleeves if you want to tempt the localizers *away* from UTF-8. Because as I said above: If the browser *defaults* to UTF-8, then UTF-8 detection isn't so useful (it would then only be useful for detecting that it is *not* Unicode). So let's say that you tell your Welsh localizer that: "Please switch to WINDOWS-1252 as the default, and then instead I'll allow you to enable this brand new UTF-8 detection." Would that make sense? > Within an origin, Firefox considers the parent frame and the previous > document in the navigation history as sources of encoding guesses. > That behavior is not user-configurable to my knowledge. W.r.t. iframe, then the "big in Norway" newspaper Dagbladet.no is declared ISO-8859-1 encoded and it includes a least one ads-iframe that is undeclared ISO-8859-1 encoded. * If I change the default encoding of Firefox to UTF-8, then the main page works but that ad fails, encoding wise. * But if I enable the Universal encoding detector, the ad does not fail. * Let's say that I *kept* ISO-8859-1 as default encoding, but instead enabled the Universal detector. The frame then works. * But if I make the frame page very short, 10 * the letter "?" as content, then the Universal detector fails - on a test on my own computer, it guess the page to be Cyrillic rather than Norwegian. * What's the problem? The Universal detector is too greedy - it tries to fix more problems than I have. I only want it to guess on "UTF-8". And if it doesn't detect UTF-8, then it should fall back to the locale default (including fall back to the encoding of the parent frame). Wouldn't that be an idea? > Firefox also remembers the encoding from previous visits as long as > Firefox otherwise has the page in cache. So for testing, it's > necessary to make Firefox forget about previous visits to the test > case. -- Leif H Silli
Received on Sunday, 11 December 2011 03:21:40 UTC