- From: Maciej Stachowiak <mjs@apple.com>
- Date: Wed, 19 Aug 2009 21:38:16 -0700
- To: "Phillips, Addison" <addison@amazon.com>
- Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
On Aug 19, 2009, at 9:22 PM, Phillips, Addison wrote: > Dear HTML5, > > The I18N Core WG would like to respond to your issue located here: > > http://www.w3.org/html/wg/tracker/issues/11 > > We remain concerned about the text in Step 7 in this section: > > http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding > > Your current text reads: > > -- > Otherwise, return an implementation-defined or user-specified > default character encoding, with the confidence tentative. In non- > legacy environments, the more comprehensive UTF-8 encoding is > recommended. Due to its use in legacy content, windows-1252 is > recommended as a default in predominantly Western demographics > instead. Since these encodings can in many cases be distinguished by > inspection, a user agent may heuristically decide which to use as a > default. > -- > > Our concerns about this text are: > > 1. It isn't clear what constitutes a "legacy" or "non-legacy > environment". We think that, for modern implementations, a bare > recommendation of UTF-8 would be preferable. That recommendation is not suitable for compatible processing of the public web. I don't believe any browser is prepared to implement such a requirement or recommendation. I don't think it makes sense to make a recommendation that is unlikely to be followed. > > 2. The sentence starting "Since these encodings can {...} be > distinguished by inspection" is not really accurate. If the user > agent has performed the optional step (6), then heuristic detection > has already been applied and failed. If the user agent has not done > step (6), then the only reasonable encoding that can reliably be > detected based solely on bit-pattern is UTF-8. > > 3. We think your intention is to permit the feature most browsers > have of allowing the user to configure (from a base default) the > character encoding to use when displaying a given page. The sentence > starting "Due to its use..." mentions "predominantly Western > demographics", which we find troublesome, especially given that it > is associated with the keyword "recommended". Browsers for Latin-script locales pretty much universally use Windows-1252 as the default of last resort. This is necessary to be compatible with legacy content on the existing Web. > > We would like to request that you reword this paragraph along the > lines of something like: > > -- > Otherwise, return an implementation-defined or user-specified > default character encoding, with the confidence tentative. The UTF-8 > encoding is recommended as a default. The default may also be set > according to the expectations and predominant legacy content > encodings for a given demographic or audience. For example, > windows-1252 is recommended as the default encoding for Western > European language environments. Other encodings may also be used. > For example, "windows-949" might be an appropriate default in a > Korean language runtime environment. > -- I don't actually have a technical objection to this wording. But it seems a little misleading. It leads with the UTF-8 recommendation, but in practice that recommendation won't be used, because browsers will use windows-1252 or something local-specific, and content will expect this. What's the benefit of leading with a UTF-8 recommendation, but then following it with alternatives that nearly everyone will have to choose in practice? > > 4. We suggest adding to step (6) this note: > > -- > Note: The UTF-8 encoding has a highly detectable bit pattern. > Documents that contain bytes > 0x7F which match the UTF-8 pattern > are very likely to be UTF-8, while documents that do not match it > definitely are not. While not full autodetection, it may be > appropriate for a user-agent to search for this common encoding. > -- That suggestion makes sense. Regards, Maciej
Received on Thursday, 20 August 2009 04:39:02 UTC