- From: Richard Ishida <ishida@w3.org>
- Date: Thu, 20 Aug 2009 10:46:44 +0100
- To: "'Maciej Stachowiak'" <mjs@apple.com>, "'Phillips, Addison'" <addison@amazon.com>
- Cc: <public-html@w3.org>, <public-i18n-core@w3.org>
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381 > "Clarify default encoding wording and add some examples for non-latin locales." So building on that, here's a first comment on 7381: I think that unless the word 'legacy' is specifically defined for this use in the HTML5 we still need to clarify it. (Especially as in Charmod, 'legacy' is used to refer to non-Unicode encodings, which may further confuse). Building on Henri's explanation, how about this wording: Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative. In controlled environments, the more comprehensive UTF-8 encoding is recommended. For the wider Web, the default may be set according to the expectations and predominant content encodings for a given demographic or audience. For example, windows-1252 is recommended as the default encoding for Western European language environments. Other encodings may also be used. For example, "windows-949" might be an appropriate default in a Korean language runtime environment. [1] We could add to the end ", and UTF-8 would be an appropriate default for scripts in many developing regions." I suggest this, not because I want to see utf-8 go for world wide web domination or because I see it as a global panacea, but because I think it helps for certain demographics or audience. The situation in these regions is often mired in competing encodings each with a non-majority user base, that impede general interoperability, and use of utf-8 tends to provide a way forward - not only by superceding other encoding schemes, but also typically by providing useful features that support the use of the script itself. I just don't want it to sound as if you should try to find a local encoding for the default in every circumstance. [2] I think it may also be worthwhile noting that the default encoding may also be that explicitly set by users in some applications (eg. Firefox and IE allow you to change the default encoding). Hope that helps, RI ============ Richard Ishida Internationalization Lead W3C (World Wide Web Consortium) http://www.w3.org/International/ http://rishida.net/ From: public-i18n-core-request@w3.org [mailto:public-i18n-core-request@w3.org] On Behalf Of Maciej Stachowiak Sent: 20 August 2009 08:33 To: Phillips, Addison Cc: public-html@w3.org; public-i18n-core@w3.org Subject: Re: HTML5 Issue 11 (encoding detection): I18N WG response... Based on further discussion with you and Henri, I filed the following: http://www.w3.org/Bugs/Public/show_bug.cgi?id=7380 "Suggest heuristic detection of UTF-8" http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381 "Clarify default encoding wording and add some examples for non-latin locales." Would you be willing to close ISSUE-11 in favor of the above two bugs? Regards, Maciej On Aug 19, 2009, at 9:22 PM, Phillips, Addison wrote: Dear HTML5, The I18N Core WG would like to respond to your issue located here: http://www.w3.org/html/wg/tracker/issues/11 We remain concerned about the text in Step 7 in this section: http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encod ing Your current text reads: -- Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative. In non-legacy environments, the more comprehensive UTF-8 encoding is recommended. Due to its use in legacy content, windows-1252 is recommended as a default in predominantly Western demographics instead. Since these encodings can in many cases be distinguished by inspection, a user agent may heuristically decide which to use as a default. -- Our concerns about this text are: 1. It isn't clear what constitutes a "legacy" or "non-legacy environment". We think that, for modern implementations, a bare recommendation of UTF-8 would be preferable. 2. The sentence starting "Since these encodings can {...} be distinguished by inspection" is not really accurate. If the user agent has performed the optional step (6), then heuristic detection has already been applied and failed. If the user agent has not done step (6), then the only reasonable encoding that can reliably be detected based solely on bit-pattern is UTF-8. 3. We think your intention is to permit the feature most browsers have of allowing the user to configure (from a base default) the character encoding to use when displaying a given page. The sentence starting "Due to its use..." mentions "predominantly Western demographics", which we find troublesome, especially given that it is associated with the keyword "recommended". We would like to request that you reword this paragraph along the lines of something like: -- Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative. The UTF-8 encoding is recommended as a default. The default may also be set according to the expectations and predominant legacy content encodings for a given demographic or audience. For example, windows-1252 is recommended as the default encoding for Western European language environments. Other encodings may also be used. For example, "windows-949" might be an appropriate default in a Korean language runtime environment. -- 4. We suggest adding to step (6) this note: -- Note: The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes > 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents that do not match it definitely are not. While not full autodetection, it may be appropriate for a user-agent to search for this common encoding. -- Addison (for I18N WG) Addison Phillips Globalization Architect -- Lab126 Internationalization is not a feature. It is an architecture.
Received on Thursday, 20 August 2009 09:46:58 UTC