Re: HTML5 Issue 11 (encoding detection): I18N WG response... from Maciej Stachowiak on 2009-08-20 (public-html@w3.org from August 2009)

From: Maciej Stachowiak <mjs@apple.com>
Date: Wed, 19 Aug 2009 21:38:16 -0700
To: "Phillips, Addison" <addison@amazon.com>
Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-id: <3C97F421-5253-4D21-86C7-DB6387426923@apple.com>

On Aug 19, 2009, at 9:22 PM, Phillips, Addison wrote:

> Dear HTML5,
>
> The I18N Core WG would like to respond to your issue located here:
>
>   http://www.w3.org/html/wg/tracker/issues/11
>
> We remain concerned about the text in Step 7 in this section:
>
>   http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
>
> Your current text reads:
>
> --
> Otherwise, return an implementation-defined or user-specified  
> default character encoding, with the confidence  tentative. In non- 
> legacy environments, the more comprehensive UTF-8 encoding is  
> recommended. Due to its use in legacy content, windows-1252 is  
> recommended as a default in predominantly Western demographics  
> instead. Since these encodings can in many cases be distinguished by  
> inspection, a user agent may heuristically decide which to use as a  
> default.
> --
>
> Our concerns about this text are:
>
> 1. It isn't clear what constitutes a "legacy" or "non-legacy  
> environment". We think that, for modern implementations, a bare  
> recommendation of UTF-8 would be preferable.

That recommendation is not suitable for compatible processing of the  
public web. I don't believe any browser is prepared to implement such  
a requirement or recommendation. I don't think it makes sense to make  
a recommendation that is unlikely to be followed.

>
> 2. The sentence starting "Since these encodings can {...} be  
> distinguished by inspection" is not really accurate. If the user  
> agent has performed the optional step (6), then heuristic detection  
> has already been applied and failed. If the user agent has not done  
> step (6), then the only reasonable encoding that can reliably be  
> detected based solely on bit-pattern is UTF-8.
>
> 3. We think your intention is to permit the feature most browsers  
> have of allowing the user to configure (from a base default) the  
> character encoding to use when displaying a given page. The sentence  
> starting "Due to its use..." mentions "predominantly Western  
> demographics", which we find troublesome, especially given that it  
> is associated with the keyword "recommended".

Browsers for Latin-script locales pretty much universally use  
Windows-1252 as the default of last resort. This is necessary to be  
compatible with legacy content on the existing Web.

>
> We would like to request that you reword this paragraph along the  
> lines of something like:
>
> --
> Otherwise, return an implementation-defined or user-specified  
> default character encoding, with the confidence tentative. The UTF-8  
> encoding is recommended as a default. The default may also be set  
> according to the expectations and predominant legacy content  
> encodings for a given demographic or audience. For example,  
> windows-1252 is recommended as the default encoding for Western  
> European language environments. Other encodings may also be used.  
> For example, "windows-949" might be an appropriate default in a  
> Korean language runtime environment.
> --

I don't actually have a technical objection to this wording. But it  
seems a little misleading. It leads with the UTF-8 recommendation, but  
in practice that recommendation won't be used, because browsers will  
use windows-1252 or something local-specific, and content will expect  
this. What's the benefit of leading with a UTF-8 recommendation, but  
then following it with alternatives that nearly everyone will have to  
choose in practice?

>
> 4. We suggest adding to step (6) this note:
>
> --
> Note: The UTF-8 encoding has a highly detectable bit pattern.  
> Documents that contain bytes > 0x7F which match the UTF-8 pattern  
> are very likely to be UTF-8, while documents that do not match it  
> definitely are not. While not full autodetection, it may be  
> appropriate for a user-agent to search for this common encoding.
> --

That suggestion makes sense.

Regards,
Maciej

Received on Thursday, 20 August 2009 04:39:04 UTC