- From: Ian Hickson <ian@hixie.ch>
- Date: Sun, 30 Aug 2009 02:37:13 +0000 (UTC)
- To: "Phillips, Addison" <addison@amazon.com>, Maciej Stachowiak <mjs@apple.com>, Henri Sivonen <hsivonen@iki.fi>, Anne van Kesteren <annevk@opera.com>, Andrew Cunningham <andrewc@vicnet.net.au>, Richard Ishida <ishida@w3.org>
- Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
On Wed, 19 Aug 2009, Phillips, Addison wrote:
>
> We remain concerned about the text in Step 7 in this section:
>
> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
>
> Your current text reads:
>
> "Otherwise, return an implementation-defined or user-specified default
> character encoding, with the confidence tentative. In non-legacy
> environments, the more comprehensive UTF-8 encoding is recommended. Due
> to its use in legacy content, windows-1252 is recommended as a default
> in predominantly Western demographics instead. Since these encodings can
> in many cases be distinguished by inspection, a user agent may
> heuristically decide which to use as a default."
>
> Our concerns about this text are:
>
> 1. It isn't clear what constitutes a "legacy" or "non-legacy
> environment".
The Web is a legacy environment. Non-legacy environments are new walled
gardens.
> We think that, for modern implementations, a bare recommendation of
> UTF-8 would be preferable.
Indeed, when legacy concerns do not apply, that's what the spec suggests.
> 2. The sentence starting "Since these encodings can {...} be
> distinguished by inspection" is not really accurate. If the user agent
> has performed the optional step (6), then heuristic detection has
> already been applied and failed. If the user agent has not done step
> (6), then the only reasonable encoding that can reliably be detected
> based solely on bit-pattern is UTF-8.
Good point. I've removed that text.
> 3. We think your intention is to permit the feature most browsers have
> of allowing the user to configure (from a base default) the character
> encoding to use when displaying a given page.
Right, the requirement is to return an "implementation-defined or
_user-specified_ default character encoding" (emphasis eadded).
> The sentence starting "Due to its use..." mentions "predominantly
> Western demographics", which we find troublesome, especially given that
> it is associated with the keyword "recommended".
Why?
> 4. We suggest adding to step (6) this note:
>
> "Note: The UTF-8 encoding has a highly detectable bit pattern. Documents
> that contain bytes 0x7F which match the UTF-8 pattern are very likely to
> be UTF-8, while documents that do not match it definitely are not. While
> not full autodetection, it may be appropriate for a user-agent to search
> for this common encoding."
I haven't added this, as I don't want this step to turn into a long list
of possible algorithms to use. However, if you have other papers I should
reference in addition to [UNIVCHARDET], I'm happy to add references.
On Thu, 20 Aug 2009, Maciej Stachowiak wrote:
>
> Based on further discussion with you and Henri, I filed the following:
>
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=7380
> "Suggest heuristic detection of UTF-8"
>
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381
> "Clarify default encoding wording and add some examples for non-latin
> locales."
Thanks. I will get to these in due course.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Sunday, 30 August 2009 02:35:34 UTC