RE: HTML5 Issue 11 (encoding detection): I18N WG response... from Phillips, Addison on 2009-08-20 (public-html@w3.org from August 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Thu, 20 Aug 2009 07:25:57 -0700
To: Maciej Stachowiak <mjs@apple.com>
CC: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA01ACCE9D8F@EX-SEA5-D.ant.amazon.com>

The I18N working group would accept closure of ISSUE-11 if you were to accept our proposed textual modifications.

We are also likely to accept some modifications to our proposed text (such as if you were to remove the UTF-8 recommendation that both the original and our proposal share), but I think the WG would like to see the specific text first.

Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

From: Maciej Stachowiak [mailto:mjs@apple.com]
Sent: Thursday, August 20, 2009 12:33 AM
To: Phillips, Addison
Cc: public-html@w3.org; public-i18n-core@w3.org
Subject: Re: HTML5 Issue 11 (encoding detection): I18N WG response...


Based on further discussion with you and Henri, I filed the following:

http://www.w3.org/Bugs/Public/show_bug.cgi?id=7380

"Suggest heuristic detection of UTF-8"


http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381

"Clarify default encoding wording and add some examples for non-latin locales."


Would you be willing to close ISSUE-11 in favor of the above two bugs?


Regards,
Maciej

On Aug 19, 2009, at 9:22 PM, Phillips, Addison wrote:


Dear HTML5,

The I18N Core WG would like to respond to your issue located here:

  http://www.w3.org/html/wg/tracker/issues/11


We remain concerned about the text in Step 7 in this section:

  http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding


Your current text reads:

--
Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence  tentative. In non-legacy environments, the more comprehensive UTF-8 encoding is recommended. Due to its use in legacy content, windows-1252 is recommended as a default in predominantly Western demographics instead. Since these encodings can in many cases be distinguished by inspection, a user agent may heuristically decide which to use as a default.
--

Our concerns about this text are:

1. It isn't clear what constitutes a "legacy" or "non-legacy environment". We think that, for modern implementations, a bare recommendation of UTF-8 would be preferable.

2. The sentence starting "Since these encodings can {...} be distinguished by inspection" is not really accurate. If the user agent has performed the optional step (6), then heuristic detection has already been applied and failed. If the user agent has not done step (6), then the only reasonable encoding that can reliably be detected based solely on bit-pattern is UTF-8.

3. We think your intention is to permit the feature most browsers have of allowing the user to configure (from a base default) the character encoding to use when displaying a given page. The sentence starting "Due to its use..." mentions "predominantly Western demographics", which we find troublesome, especially given that it is associated with the keyword "recommended".

We would like to request that you reword this paragraph along the lines of something like:

--
Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative. The UTF-8 encoding is recommended as a default. The default may also be set according to the expectations and predominant legacy content encodings for a given demographic or audience. For example, windows-1252 is recommended as the default encoding for Western European language environments. Other encodings may also be used. For example, "windows-949" might be an appropriate default in a Korean language runtime environment.
--

4. We suggest adding to step (6) this note:

--
Note: The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes > 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents that do not match it definitely are not. While not full autodetection, it may be appropriate for a user-agent to search for this common encoding.
--

Addison (for I18N WG)

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

Received on Thursday, 20 August 2009 14:26:37 UTC