Re: HTML5 Issue 11 (encoding detection): I18N WG response...

On Wed, 19 Aug 2009, Phillips, Addison wrote:
> We remain concerned about the text in Step 7 in this section:
> Your current text reads:
> "Otherwise, return an implementation-defined or user-specified default 
> character encoding, with the confidence tentative. In non-legacy 
> environments, the more comprehensive UTF-8 encoding is recommended. Due 
> to its use in legacy content, windows-1252 is recommended as a default 
> in predominantly Western demographics instead. Since these encodings can 
> in many cases be distinguished by inspection, a user agent may 
> heuristically decide which to use as a default."
> Our concerns about this text are:
> 1. It isn't clear what constitutes a "legacy" or "non-legacy 
> environment".

The Web is a legacy environment. Non-legacy environments are new walled 

> We think that, for modern implementations, a bare recommendation of 
> UTF-8 would be preferable.

Indeed, when legacy concerns do not apply, that's what the spec suggests.

> 2. The sentence starting "Since these encodings can {...} be 
> distinguished by inspection" is not really accurate. If the user agent 
> has performed the optional step (6), then heuristic detection has 
> already been applied and failed. If the user agent has not done step 
> (6), then the only reasonable encoding that can reliably be detected 
> based solely on bit-pattern is UTF-8.

Good point. I've removed that text.

> 3. We think your intention is to permit the feature most browsers have 
> of allowing the user to configure (from a base default) the character 
> encoding to use when displaying a given page.

Right, the requirement is to return an "implementation-defined or 
_user-specified_ default character encoding" (emphasis eadded).

> The sentence starting "Due to its use..." mentions "predominantly 
> Western demographics", which we find troublesome, especially given that 
> it is associated with the keyword "recommended".


> 4. We suggest adding to step (6) this note:
> "Note: The UTF-8 encoding has a highly detectable bit pattern. Documents 
> that contain bytes 0x7F which match the UTF-8 pattern are very likely to 
> be UTF-8, while documents that do not match it definitely are not. While 
> not full autodetection, it may be appropriate for a user-agent to search 
> for this common encoding."

I haven't added this, as I don't want this step to turn into a long list 
of possible algorithms to use. However, if you have other papers I should 
reference in addition to [UNIVCHARDET], I'm happy to add references.

On Thu, 20 Aug 2009, Maciej Stachowiak wrote:
> Based on further discussion with you and Henri, I filed the following:
> "Suggest heuristic detection of UTF-8"
> "Clarify default encoding wording and add some examples for non-latin
> locales."

Thanks. I will get to these in due course.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Sunday, 30 August 2009 02:35:36 UTC