HTML5 Issue 11 (encoding detection): I18N WG response... from Phillips, Addison on 2009-08-20 (public-i18n-core@w3.org from July to September 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Wed, 19 Aug 2009 21:22:30 -0700
To: "public-html@w3.org" <public-html@w3.org>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA01ACCE9AF4@EX-SEA5-D.ant.amazon.com>

Dear HTML5,

The I18N Core WG would like to respond to your issue located here:

http://www.w3.org/html/wg/tracker/issues/11

We remain concerned about the text in Step 7 in this section:

http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding

Your current text reads:

--
Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative. In non-legacy environments, the more comprehensive UTF-8 encoding is recommended. Due to its use in legacy content, windows-1252 is recommended as a default in predominantly Western demographics instead. Since these encodings can in many cases be distinguished by inspection, a user agent may heuristically decide which to use as a default.
--

Our concerns about this text are:

1. It isn't clear what constitutes a "legacy" or "non-legacy environment". We think that, for modern implementations, a bare recommendation of UTF-8 would be preferable.

2. The sentence starting "Since these encodings can {...} be distinguished by inspection" is not really accurate. If the user agent has performed the optional step (6), then heuristic detection has already been applied and failed. If the user agent has not done step (6), then the only reasonable encoding that can reliably be detected based solely on bit-pattern is UTF-8.

3. We think your intention is to permit the feature most browsers have of allowing the user to configure (from a base default) the character encoding to use when displaying a given page. The sentence starting "Due to its use..." mentions "predominantly Western demographics", which we find troublesome, especially given that it is associated with the keyword "recommended".

We would like to request that you reword this paragraph along the lines of something like:

--
Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative. The UTF-8 encoding is recommended as a default. The default may also be set according to the expectations and predominant legacy content encodings for a given demographic or audience. For example, windows-1252 is recommended as the default encoding for Western European language environments. Other encodings may also be used. For example, "windows-949" might be an appropriate default in a Korean language runtime environment.
--

4. We suggest adding to step (6) this note:

--
Note: The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes > 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents that do not match it definitely are not. While not full autodetection, it may be appropriate for a user-agent to search for this common encoding.
--

Addison (for I18N WG)

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

Received on Thursday, 20 August 2009 04:23:17 UTC