- From: Maciej Stachowiak <mjs@apple.com>
- Date: Wed, 19 Aug 2009 21:38:16 -0700
- To: "Phillips, Addison" <addison@amazon.com>
- Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
On Aug 19, 2009, at 9:22 PM, Phillips, Addison wrote:
> Dear HTML5,
>
> The I18N Core WG would like to respond to your issue located here:
>
> http://www.w3.org/html/wg/tracker/issues/11
>
> We remain concerned about the text in Step 7 in this section:
>
> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
>
> Your current text reads:
>
> --
> Otherwise, return an implementation-defined or user-specified
> default character encoding, with the confidence tentative. In non-
> legacy environments, the more comprehensive UTF-8 encoding is
> recommended. Due to its use in legacy content, windows-1252 is
> recommended as a default in predominantly Western demographics
> instead. Since these encodings can in many cases be distinguished by
> inspection, a user agent may heuristically decide which to use as a
> default.
> --
>
> Our concerns about this text are:
>
> 1. It isn't clear what constitutes a "legacy" or "non-legacy
> environment". We think that, for modern implementations, a bare
> recommendation of UTF-8 would be preferable.
That recommendation is not suitable for compatible processing of the
public web. I don't believe any browser is prepared to implement such
a requirement or recommendation. I don't think it makes sense to make
a recommendation that is unlikely to be followed.
>
> 2. The sentence starting "Since these encodings can {...} be
> distinguished by inspection" is not really accurate. If the user
> agent has performed the optional step (6), then heuristic detection
> has already been applied and failed. If the user agent has not done
> step (6), then the only reasonable encoding that can reliably be
> detected based solely on bit-pattern is UTF-8.
>
> 3. We think your intention is to permit the feature most browsers
> have of allowing the user to configure (from a base default) the
> character encoding to use when displaying a given page. The sentence
> starting "Due to its use..." mentions "predominantly Western
> demographics", which we find troublesome, especially given that it
> is associated with the keyword "recommended".
Browsers for Latin-script locales pretty much universally use
Windows-1252 as the default of last resort. This is necessary to be
compatible with legacy content on the existing Web.
>
> We would like to request that you reword this paragraph along the
> lines of something like:
>
> --
> Otherwise, return an implementation-defined or user-specified
> default character encoding, with the confidence tentative. The UTF-8
> encoding is recommended as a default. The default may also be set
> according to the expectations and predominant legacy content
> encodings for a given demographic or audience. For example,
> windows-1252 is recommended as the default encoding for Western
> European language environments. Other encodings may also be used.
> For example, "windows-949" might be an appropriate default in a
> Korean language runtime environment.
> --
I don't actually have a technical objection to this wording. But it
seems a little misleading. It leads with the UTF-8 recommendation, but
in practice that recommendation won't be used, because browsers will
use windows-1252 or something local-specific, and content will expect
this. What's the benefit of leading with a UTF-8 recommendation, but
then following it with alternatives that nearly everyone will have to
choose in practice?
>
> 4. We suggest adding to step (6) this note:
>
> --
> Note: The UTF-8 encoding has a highly detectable bit pattern.
> Documents that contain bytes > 0x7F which match the UTF-8 pattern
> are very likely to be UTF-8, while documents that do not match it
> definitely are not. While not full autodetection, it may be
> appropriate for a user-agent to search for this common encoding.
> --
That suggestion makes sense.
Regards,
Maciej
Received on Thursday, 20 August 2009 04:39:02 UTC