- From: Maciej Stachowiak <mjs@apple.com>
- Date: Thu, 20 Aug 2009 00:32:58 -0700
- To: "Phillips, Addison" <addison@amazon.com>
- Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
- Message-id: <E2B1860A-6BDD-4008-B157-901FE4A59005@apple.com>
Based on further discussion with you and Henri, I filed the following:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=7380
"Suggest heuristic detection of UTF-8"
http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381
"Clarify default encoding wording and add some examples for non-latin
locales."
Would you be willing to close ISSUE-11 in favor of the above two bugs?
Regards,
Maciej
On Aug 19, 2009, at 9:22 PM, Phillips, Addison wrote:
> Dear HTML5,
>
> The I18N Core WG would like to respond to your issue located here:
>
> http://www.w3.org/html/wg/tracker/issues/11
>
> We remain concerned about the text in Step 7 in this section:
>
> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
>
> Your current text reads:
>
> --
> Otherwise, return an implementation-defined or user-specified
> default character encoding, with the confidence tentative. In non-
> legacy environments, the more comprehensive UTF-8 encoding is
> recommended. Due to its use in legacy content, windows-1252 is
> recommended as a default in predominantly Western demographics
> instead. Since these encodings can in many cases be distinguished by
> inspection, a user agent may heuristically decide which to use as a
> default.
> --
>
> Our concerns about this text are:
>
> 1. It isn't clear what constitutes a "legacy" or "non-legacy
> environment". We think that, for modern implementations, a bare
> recommendation of UTF-8 would be preferable.
>
> 2. The sentence starting "Since these encodings can {...} be
> distinguished by inspection" is not really accurate. If the user
> agent has performed the optional step (6), then heuristic detection
> has already been applied and failed. If the user agent has not done
> step (6), then the only reasonable encoding that can reliably be
> detected based solely on bit-pattern is UTF-8.
>
> 3. We think your intention is to permit the feature most browsers
> have of allowing the user to configure (from a base default) the
> character encoding to use when displaying a given page. The sentence
> starting "Due to its use..." mentions "predominantly Western
> demographics", which we find troublesome, especially given that it
> is associated with the keyword "recommended".
>
> We would like to request that you reword this paragraph along the
> lines of something like:
>
> --
> Otherwise, return an implementation-defined or user-specified
> default character encoding, with the confidence tentative. The UTF-8
> encoding is recommended as a default. The default may also be set
> according to the expectations and predominant legacy content
> encodings for a given demographic or audience. For example,
> windows-1252 is recommended as the default encoding for Western
> European language environments. Other encodings may also be used.
> For example, "windows-949" might be an appropriate default in a
> Korean language runtime environment.
> --
>
> 4. We suggest adding to step (6) this note:
>
> --
> Note: The UTF-8 encoding has a highly detectable bit pattern.
> Documents that contain bytes > 0x7F which match the UTF-8 pattern
> are very likely to be UTF-8, while documents that do not match it
> definitely are not. While not full autodetection, it may be
> appropriate for a user-agent to search for this common encoding.
> --
>
> Addison (for I18N WG)
>
> Addison Phillips
> Globalization Architect -- Lab126
>
> Internationalization is not a feature.
> It is an architecture.
>
Received on Thursday, 20 August 2009 07:33:42 UTC