RE: Auto-detect and encodings in HTML5

+1 (since Henri answered your question so nicely :)

-----Original Message-----
From: Henri Sivonen [] 
Sent: Tuesday, June 02, 2009 1:11 AM
To: Larry Masinter
Cc: Chris Wilson; Maciej Stachowiak; M.T. Carrasco Benitez; Travis Leithead; Erik van der Poel;;; Richard Ishida; Ian Hickson; Harley Rosnow
Subject: Re: Auto-detect and encodings in HTML5

On Jun 1, 2009, at 20:44, Larry Masinter wrote:

> Chris, in your note below you claim that the "current de facto"  
> value was "Win1252"
> which seems to contradict what I thought was claimed in another  
> message that the
> "de facto" default was "unknown" (which was my understanding, i.e.,  
> that browsers
> used a wide variety of heuristics to determine charset).

The de facto default is Windows-1252 except for locales where it  
isn't. If a user mostly browses pages written in Simplified Chinese,  
it makes sense to make GBK the default (GBK is to GB2312 what  
Windows-1252 is to ISO-8859-1) at least when heuristics are turned off.

At least for the U.S. locale, Firefox and IE8 default to heuristics  
off. The user can enable heuristics for CJK and Cyrillic (in various  
groupings: all, both kinds of Chinese only, only Simplified Chinese,  
only Russian, etc.). Firefox (but not IE8) also supports a grouping  
for all of CJK (excluding Cyrillic).

In Opera, the heuristic detector may be enabled for any one of C, J  
and K or Cyrillic. (Opera doesn't seem to have a Russian-only,  
Ukranian-only, Traditional Chinese-only or Simplified Chinese-only  
modes.) It's unclear if Opera's default behavior is heuristics off or  
universal heuristics on. Perhaps someone from Opera could enlighten us.

Safari doesn't have heuristic selection UI. It is unclear if Safari  
has no heuristics or whether it has always-on heuristics by default.  
Chrome's UI is slightly differently ambiguous.

(For clarity: Above I'm using the word "heuristics" to mean  
exclusively the frequency and chaining analysis on bytes.)

> I'm interested in reducing ambiguity and making web transactions  
> more reliable,
> and associating a new version indicator (DOCTYPE) with a more  
> constrained default
> (charset default UTF8, rather than 'unknown') is reasonable, while I  
> also would
> be opposed to making an incompatible change with actual current  
> behavior.

We already have 3 reliable version indicators for encoding axis of  
charset=utf-8 on the HTTP layer
charset=utf-8 in <meta>
the UTF-8 BOM

We don't need a new indicator that wouldn't be as compatible with  
existing user agents as the indicators we already have. (Consider the  
Degrade Gracefully principle.)

On Jun 2, 2009, at 03:48, Leif Halvard Silli wrote:

> Is it the choice of UTF-8 as default you don't understand? If so,  
> then I'd like to quote the "Support World Languages" principle.

The Support World Languages principle is satisfied by HTML5 allowing  
authors easily to opt in to UTF-8. It has to be opt in due to the  
Support Existing Content and Degrade Gracefully principles.

Henri Sivonen

Received on Tuesday, 2 June 2009 18:11:55 UTC