[whatwg] Default encoding to UTF-8?

Henri Sivonen Fri Dec 9 05:34:08 PST 2011:
> On Fri, Dec 9, 2011 at 12:33 AM, Leif Halvard Silli:
>> Henri Sivonen Tue Dec 6 23:45:11 PST 2011:
>> These localizations are nevertheless live tests. If we want to move
>> more firmly in the direction of UTF-8, one could ask users of those
>> 'live tests' about their experience.
> 
> Filed https://bugzilla.mozilla.org/show_bug.cgi?id=708995

This is brilliant. Looking forward to the results!

>>> (which means
>>> *other-language* pages when the language of the localization doesn't
>>> have a pre-UTF-8 legacy).
>>
>> Do you have any concrete examples?
> 
> The example I had in mind was Welsh.

Logical candidate. WHat do you know about the Farsi and Arabic local? 
HTML5 specifies UTF-8 for them - due to the way Firefox behaves, I 
think. IE seems to be the big dominator for these locales - at least in 
Iran. Firefox was number two in Iran, but still only at around 5 
percent, in the stats I saw.

Btw, as I looked into Iran a bit ... I discovered that "UNICODE" is 
used as alias for "UTF-16" in IE and Webkit. And, for XML, then Webkit, 
Firefox and Opera sees it as a non-fatal error (but Opera just treats 
all illegal names that way). WHile IE9 seems to se it as fatal. File an 
HTML5 bug:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15142

Seemingly, this has not affected Firefox users too much. Which must 
EITHER mean that many of these pages *are* UTF-16 encoded OR that their 
content is predominantly  US-ASCII and thus the artefacts of parsing 
UTF-8 pages ("UTF-16" should be treated as "UTF-8 when it isn't 
"UTF-16") as WINDOWS-1252, do not affect users too much.

I mention it here for 3 reasons: 

 (1) charset=Unicode inside <meta> is caused by MSHTML, including Word. 
And Boris mentioned Word's behaviour as a reason for keeping the legacy 
defaulting. However, when MSHTML saves with charset=UNICODE, then for 
browsers to legacy default is not the correct behaviour. (I don't know 
exactly when MSHTML spits out charset=UNICODE, though - or whether it 
is locale affected whether MSHTML spits out charset=UNICODE - or what.)

 (2) for the user tests you suggested in Mozilla bug 708995 (above), 
the presence of <meta charset=UNICODE> would trigger a need for Firefox 
users to select UTF-8 - unless the locale already defaults to UTF-8; 

 (3) That HTML5 bug 15142 (see above) has been unknown (?) till now, 
despite that it affects Firefox and Opera, hints that, for the 
"WINDOWS-1252 languages", when they are served as UTF-8 but parsed as  
WINDOWS-1252 (by Firefox and Opera), then users survive. (Of course, 
some of these pages will be "picked up" by an Apache Content-Type: 
header declaring the encoding or perhaps be chardet?

>> And are there user complaints?
> 
> Not that I know of, but I'm not part of a feedback loop if there even
> is a feedback loop here.
> 
>> The Serb localization uses UTF-8. The Croat uses Win-1252, but only on
>> Windows and Mac: On Linux it appears to use UTF-8, if I read the HG
>> repository correctly.
> 
> OS-dependent differences are *very* suspicious. :-(

Mmm, yes. 

>>> I think that defaulting to UTF-8 is always a bug, because at the time
>>> these localizations were launched, there should have been no unlabeled
>>> UTF-8 legacy, because up until these locales were launched, no
>>> browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
>>> UTF-8 is harmful, because it makes it possible for locale-siloed
>>> unlabeled UTF-8 content come to existence
>>
>> The current legacy encodings nevertheless creates siloed pages already.
>> I'm also not sure that it would be a problem with such a UTF-8 silo:
>> UTF-8 is possible to detect, for browsers - Chrome seems to perform
>> more such detection than other browsers.
> 
> While UTF-8 is possible to detect, I really don't want to take Firefox
> down the road where users who currently don't have to suffer page load
> restarts from heuristic detection have to start suffering them. (I
> think making incremental rendering any less incremental for locales
> that currently don't use a detector is not an acceptable solution for
> avoiding restarts. With English-language pages, the UTF-8ness might
> not be apparent from the first 1024 bytes.)

FIRSTLY, HTML5:

]] 8.2.2.4 Changing the encoding while parsing
[...] This might happen if the encoding sniffing algorithm described 
above failed to find an encoding, or if it found an encoding that was 
not the actual encoding of the file. [[

Thus, trying to detect UTF-8 is second last step of the sniffing 
algorithm. If it, correctly, detects UTF-8, then, while the detection 
probably affects performance, detecting UTF-8 should not lead to a need 
for re-parsing the page?

SECONDLY: If there is a UTF-8 silo - that leads to undeclared UTF-8 
pages, then it is the browsers *outside* that silo which eventually 
suffers (browser that do default to UTF-8 do not need to perform UTF-8 
detect, I suppose - or what?). So then it is partly a matter of how 
large the silo is.

Regardless, we must consider: The alternative to undeclared UTF-8 pages 
would be to be undeclared legacy encoding pages, roughly speaking. 
Which the browsers outside the silo then would have to detect. And 
which would be more *demand* to detect than simply detecting UTF-8.

However, what you had in min was the change of the default encoding for 
a particular silo from legacy encoding to UTF-8. This, I agree, would 
lead to some pages being treated as UTF-8 - to begin with. But when the 
browser detects that this is wrong, it would have to switch to - 
probably - the "old" default - the legacy encoding.

However, why would it switch *from* UTF-8 if UTF-8 is the default? We 
must keep the problem in mind: For the siloed browser, UTF-8 will be 
its fall-back encoding.

>> In another message you suggested I 'lobby' against authoring tools. OK.
>> But the browser is also an authoring tool.
> 
> In what sense?

The problem with defaults is when they take effect without one's 
knowledge. Or one may think everything is OK, until one sees that it 
isn't. 

The Respec.js script works in your browser, and saving the output, is 
one of the problems it has: 
http://dev.w3.org/2009/dap/ReSpec.js/documentation.html#saving-the-generated-specification

Quote: "And sadly enough browsers aren't very good at saving HTML 
that's been modified by script."

The docs does not discuss the encoding problem. But I believe that is 
exactly one of the problems it has.

* A browser will save the page using the page's encoding.
* A browser will not add a META element if the page doesn't have 
  one. Thus, if it is HTTP which specifies the encoding, then 
  saving it on the computer, will mean that the next time it opens
  - from the hard disk, the page will default to the locale 
  default, meaing that one must select UTF-8 to make the page 
  readable. (MSHTML - aka IE - will add the encoding - such as 
  charset=UNICODE ... - if you switch the encoding during saving
  - I'm not exactly sure about the requirements.)

This probably needs more thought and more ideas, but what can be done 
to make this better? One reason for the browser to not add <meta 
charset="something" /> if the page doesn't have it already is, perhaps, 
that it could be incorrect - may be because the user changed the 
encoding manually. But if we consider how text editors - e.g. on the 
Mac - have been working for a while now, then you have to take steps if 
you *don't* want to save the page as UTF-8. So perhaps browsers could 
start to behave the same way? That is: Regardless of original encoding, 
save it as UTF-8, unless the user overrides it?

 * Another idea: Perform heuristics more extensively when the file is 
on the hard disk than when it is online? No, this could lead users to 
think it work because it works offline? 
 
>> So how can we have authors
>> output UTF-8, by default, without changing the parsing default?
> 
> Changing the default is an XML-like solution: creating breakage for
> users (who view legacy pages) in order to change author behavior.

That reasoning doesn't consider that everyone that saves an HTML page 
from the Web to one's hard disk, is an author. One is avoiding to make 
the roundtrip behaviour more reliable because there exists an ever 
diminishing amount of legacy encoded pages.

> To the extent a browser is a tool Web authors use to test stuff, it's
> possible to add various whining to console without breaking legacy
> sites for users. See
> https://bugzilla.mozilla.org/show_bug.cgi?id=672453
> https://bugzilla.mozilla.org/show_bug.cgi?id=708620

Good stuff!

>> Btw: In Firefox, then in one sense, it is impossible to disable
>> "automatic" character detection: In Firefox, overriding of the encoding
>> only lasts until the next reload.
> 
> A persistent setting for changing the fallback default is in the
> "Advanced" subdialog of the font prefs in the "Content" preference
> pane.

I know. I was not commenting, here, on the "global" default encoding. 
But instead on a subtle difference between the effect of a manual 
override in Firefox (and IE) compared to especially Safari. In Safari - 
if you have e.g. an UTF-8 page that is otherwise correctly made, e.g. 
with <meta charset="UTF-8"> - then a manual switch to e.g. KOI8-R will 
have lasting effect, in the current tab: You can reload the page as 
many times you wish: Each time it will be treated as KOI8-R. While in 
Firefox and IE, the manual switch to KOI8-R only lasts for one reload. 
Next time you reload, the browser will listen to the encoding signals 
from the page and from the server again. 

Opera, instead, remembers your manual switch of the encoding even if 
you try to open the page in a new tab or window and even after a 
browser restart - Opera is alone in doing this, which I think is agains 
HTML5: HTML5 only allows the browser to override what the page says 
*provided* that the page doesn't say anything ... (As such, even the 
Safari behaviour is dubious, I'd say. FWIT, iCab allows you to tell it 
to "please start listen to the signals from the page and server, 
again".)

SO: What I meant by "impossible to disable", thus, was that Firefox and 
IE, from the user's perspective, behaves "automatic" even if the 
auto-detect is disabled: They listen to the signals from the page and 
server rather than, like Safari and Opera, listen to the "last signal 
from the user". 

> It's rather counterintuitive that the persistent autodetection
> setting is in the same menu as the one-off override.

You talk about View->Character_Encoding->Auto-Detect->Off ? Anyway: I 
agree that the encoding menus could be simpler/clearer.

I think the most counter-intuitive thing is to use the word 
"auto-detect" about the heuristic detection - see what I said above 
about "behaves automatic even when auto-detect is disabled". Opera's 
default setting is called "Automatic selection". So it is "all 
automatic" ... 

> As for heuristic detection based on the bytes of the page, the only
> heuristic that can't be disabled is the heuristic for detecting
> BOMless UTF-16 that encodes Basic Latin only. (Some Indian bank was
> believed to have been giving that sort of files to their customers and
> it "worked" in pre-HTML5 browsers that silently discarded all zero
> bytes prior to tokenization.) The Cyrillic and CJK detection
> heuristics can be turned on and off by the user.

I always wondered what the "Universal" detection meant. Is that simply 
the UTF-8 detection? Or does it also detect other encodings? Unicode is 
sometimes referred to as the "Universal" encoding/character set. If 
that is what it means, then "Unicode" would have been cleared than 
"Universal". 

Hm, it seems like it is meant Universal and not Unicode, in which case 
"All" or similar would have been better ...

So it seems to me that it is not possible to *enable* only *UTF-8 
detection*: The only option for getting UTF-8 detection is to use the 
Universal detection - which enables everything.

It seems to me that if you offered *only* UTF-8 detection, then you 
would have something useful up in your sleeves if you want to tempt the 
localizers *away* from UTF-8. Because as I said above: If the browser 
*defaults* to UTF-8, then UTF-8 detection isn't so useful (it would 
then only be useful for detecting that it is *not* Unicode).

So let's say that you tell your Welsh localizer that: "Please switch to 
WINDOWS-1252 as the default, and then instead I'll allow you to enable 
this brand new UTF-8 detection." Would that make sense?
 
> Within an origin, Firefox considers the parent frame and the previous
> document in the navigation history as sources of encoding guesses.
> That behavior is not user-configurable to my knowledge.

W.r.t. iframe, then the "big in Norway" newspaper Dagbladet.no is 
declared ISO-8859-1 encoded and it includes a least one ads-iframe that 
is undeclared ISO-8859-1 encoded.

* If I change the default encoding of Firefox to UTF-8, then the main 
page works but that ad fails, encoding wise. 
* But if I enable the Universal encoding detector, the ad does not fail.

* Let's say that I *kept* ISO-8859-1 as default encoding, but instead 
enabled the Universal detector. The frame then works.
* But if I make the frame page very short, 10 * the letter "?" as 
content, then the Universal detector fails - on a test on my own 
computer, it guess the page to be Cyrillic rather than Norwegian.
* What's the problem? The Universal detector is too greedy - it tries 
to fix more problems than I have. I only want it to guess on "UTF-8". 
And if it doesn't detect UTF-8, then it should fall back to the locale 
default (including fall back to the encoding of the parent frame).

Wouldn't that be an idea?

> Firefox also remembers the encoding from previous visits as long as
> Firefox otherwise has the page in cache. So for testing, it's
> necessary to make Firefox forget about previous visits to the test
> case.
-- 
Leif H Silli

Received on Sunday, 11 December 2011 03:21:40 UTC