- From: Maciej Stachowiak <mjs@apple.com>
- Date: Thu, 20 Aug 2009 00:32:58 -0700
- To: "Phillips, Addison" <addison@amazon.com>
- Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
- Message-id: <E2B1860A-6BDD-4008-B157-901FE4A59005@apple.com>
Based on further discussion with you and Henri, I filed the following: http://www.w3.org/Bugs/Public/show_bug.cgi?id=7380 "Suggest heuristic detection of UTF-8" http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381 "Clarify default encoding wording and add some examples for non-latin locales." Would you be willing to close ISSUE-11 in favor of the above two bugs? Regards, Maciej On Aug 19, 2009, at 9:22 PM, Phillips, Addison wrote: > Dear HTML5, > > The I18N Core WG would like to respond to your issue located here: > > http://www.w3.org/html/wg/tracker/issues/11 > > We remain concerned about the text in Step 7 in this section: > > http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding > > Your current text reads: > > -- > Otherwise, return an implementation-defined or user-specified > default character encoding, with the confidence tentative. In non- > legacy environments, the more comprehensive UTF-8 encoding is > recommended. Due to its use in legacy content, windows-1252 is > recommended as a default in predominantly Western demographics > instead. Since these encodings can in many cases be distinguished by > inspection, a user agent may heuristically decide which to use as a > default. > -- > > Our concerns about this text are: > > 1. It isn't clear what constitutes a "legacy" or "non-legacy > environment". We think that, for modern implementations, a bare > recommendation of UTF-8 would be preferable. > > 2. The sentence starting "Since these encodings can {...} be > distinguished by inspection" is not really accurate. If the user > agent has performed the optional step (6), then heuristic detection > has already been applied and failed. If the user agent has not done > step (6), then the only reasonable encoding that can reliably be > detected based solely on bit-pattern is UTF-8. > > 3. We think your intention is to permit the feature most browsers > have of allowing the user to configure (from a base default) the > character encoding to use when displaying a given page. The sentence > starting "Due to its use..." mentions "predominantly Western > demographics", which we find troublesome, especially given that it > is associated with the keyword "recommended". > > We would like to request that you reword this paragraph along the > lines of something like: > > -- > Otherwise, return an implementation-defined or user-specified > default character encoding, with the confidence tentative. The UTF-8 > encoding is recommended as a default. The default may also be set > according to the expectations and predominant legacy content > encodings for a given demographic or audience. For example, > windows-1252 is recommended as the default encoding for Western > European language environments. Other encodings may also be used. > For example, "windows-949" might be an appropriate default in a > Korean language runtime environment. > -- > > 4. We suggest adding to step (6) this note: > > -- > Note: The UTF-8 encoding has a highly detectable bit pattern. > Documents that contain bytes > 0x7F which match the UTF-8 pattern > are very likely to be UTF-8, while documents that do not match it > definitely are not. While not full autodetection, it may be > appropriate for a user-agent to search for this common encoding. > -- > > Addison (for I18N WG) > > Addison Phillips > Globalization Architect -- Lab126 > > Internationalization is not a feature. > It is an architecture. >
Received on Thursday, 20 August 2009 07:33:41 UTC