Re: HTML5 Issue 11 (encoding detection): I18N WG response... from Ian Hickson on 2009-08-30 (public-html@w3.org from August 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Sun, 30 Aug 2009 02:37:13 +0000 (UTC)
To: "Phillips, Addison" <addison@amazon.com>, Maciej Stachowiak <mjs@apple.com>, Henri Sivonen <hsivonen@iki.fi>, Anne van Kesteren <annevk@opera.com>, Andrew Cunningham <andrewc@vicnet.net.au>, Richard Ishida <ishida@w3.org>
Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <Pine.LNX.4.62.0908300224440.6775@hixie.dreamhostps.com>

On Wed, 19 Aug 2009, Phillips, Addison wrote:
> 
> We remain concerned about the text in Step 7 in this section:
>    
> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
> 
> Your current text reads:
> 
> "Otherwise, return an implementation-defined or user-specified default 
> character encoding, with the confidence tentative. In non-legacy 
> environments, the more comprehensive UTF-8 encoding is recommended. Due 
> to its use in legacy content, windows-1252 is recommended as a default 
> in predominantly Western demographics instead. Since these encodings can 
> in many cases be distinguished by inspection, a user agent may 
> heuristically decide which to use as a default."
> 
> Our concerns about this text are:
> 
> 1. It isn't clear what constitutes a "legacy" or "non-legacy 
> environment".

The Web is a legacy environment. Non-legacy environments are new walled 
gardens.


> We think that, for modern implementations, a bare recommendation of 
> UTF-8 would be preferable.

Indeed, when legacy concerns do not apply, that's what the spec suggests.


> 2. The sentence starting "Since these encodings can {...} be 
> distinguished by inspection" is not really accurate. If the user agent 
> has performed the optional step (6), then heuristic detection has 
> already been applied and failed. If the user agent has not done step 
> (6), then the only reasonable encoding that can reliably be detected 
> based solely on bit-pattern is UTF-8.

Good point. I've removed that text.


> 3. We think your intention is to permit the feature most browsers have 
> of allowing the user to configure (from a base default) the character 
> encoding to use when displaying a given page.

Right, the requirement is to return an "implementation-defined or 
_user-specified_ default character encoding" (emphasis eadded).


> The sentence starting "Due to its use..." mentions "predominantly 
> Western demographics", which we find troublesome, especially given that 
> it is associated with the keyword "recommended".

Why?


> 4. We suggest adding to step (6) this note:
> 
> "Note: The UTF-8 encoding has a highly detectable bit pattern. Documents 
> that contain bytes 0x7F which match the UTF-8 pattern are very likely to 
> be UTF-8, while documents that do not match it definitely are not. While 
> not full autodetection, it may be appropriate for a user-agent to search 
> for this common encoding."

I haven't added this, as I don't want this step to turn into a long list 
of possible algorithms to use. However, if you have other papers I should 
reference in addition to [UNIVCHARDET], I'm happy to add references.


On Thu, 20 Aug 2009, Maciej Stachowiak wrote:
> 
> Based on further discussion with you and Henri, I filed the following:
> 
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=7380
> "Suggest heuristic detection of UTF-8"
> 
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381
> "Clarify default encoding wording and add some examples for non-latin
> locales."

Thanks. I will get to these in due course.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Sunday, 30 August 2009 02:35:36 UTC