Re: HTML5 Issue 11 (encoding detection): I18N WG response... from Maciej Stachowiak on 2009-08-20 (public-html@w3.org from August 2009)

From: Maciej Stachowiak <mjs@apple.com>
Date: Thu, 20 Aug 2009 00:32:58 -0700
To: "Phillips, Addison" <addison@amazon.com>
Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-id: <E2B1860A-6BDD-4008-B157-901FE4A59005@apple.com>

Based on further discussion with you and Henri, I filed the following:

http://www.w3.org/Bugs/Public/show_bug.cgi?id=7380
"Suggest heuristic detection of UTF-8"

http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381
"Clarify default encoding wording and add some examples for non-latin  
locales."

Would you be willing to close ISSUE-11 in favor of the above two bugs?

Regards,
Maciej

On Aug 19, 2009, at 9:22 PM, Phillips, Addison wrote:

> Dear HTML5,
>
> The I18N Core WG would like to respond to your issue located here:
>
>   http://www.w3.org/html/wg/tracker/issues/11
>
> We remain concerned about the text in Step 7 in this section:
>
>   http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
>
> Your current text reads:
>
> --
> Otherwise, return an implementation-defined or user-specified  
> default character encoding, with the confidence  tentative. In non- 
> legacy environments, the more comprehensive UTF-8 encoding is  
> recommended. Due to its use in legacy content, windows-1252 is  
> recommended as a default in predominantly Western demographics  
> instead. Since these encodings can in many cases be distinguished by  
> inspection, a user agent may heuristically decide which to use as a  
> default.
> --
>
> Our concerns about this text are:
>
> 1. It isn't clear what constitutes a "legacy" or "non-legacy  
> environment". We think that, for modern implementations, a bare  
> recommendation of UTF-8 would be preferable.
>
> 2. The sentence starting "Since these encodings can {...} be  
> distinguished by inspection" is not really accurate. If the user  
> agent has performed the optional step (6), then heuristic detection  
> has already been applied and failed. If the user agent has not done  
> step (6), then the only reasonable encoding that can reliably be  
> detected based solely on bit-pattern is UTF-8.
>
> 3. We think your intention is to permit the feature most browsers  
> have of allowing the user to configure (from a base default) the  
> character encoding to use when displaying a given page. The sentence  
> starting "Due to its use..." mentions "predominantly Western  
> demographics", which we find troublesome, especially given that it  
> is associated with the keyword "recommended".
>
> We would like to request that you reword this paragraph along the  
> lines of something like:
>
> --
> Otherwise, return an implementation-defined or user-specified  
> default character encoding, with the confidence tentative. The UTF-8  
> encoding is recommended as a default. The default may also be set  
> according to the expectations and predominant legacy content  
> encodings for a given demographic or audience. For example,  
> windows-1252 is recommended as the default encoding for Western  
> European language environments. Other encodings may also be used.  
> For example, "windows-949" might be an appropriate default in a  
> Korean language runtime environment.
> --
>
> 4. We suggest adding to step (6) this note:
>
> --
> Note: The UTF-8 encoding has a highly detectable bit pattern.  
> Documents that contain bytes > 0x7F which match the UTF-8 pattern  
> are very likely to be UTF-8, while documents that do not match it  
> definitely are not. While not full autodetection, it may be  
> appropriate for a user-agent to search for this common encoding.
> --
>
> Addison (for I18N WG)
>
> Addison Phillips
> Globalization Architect -- Lab126
>
> Internationalization is not a feature.
> It is an architecture.
>

Received on Thursday, 20 August 2009 07:33:42 UTC