- From: Addison Phillips <addison@yahoo-inc.com>
- Date: Sat, 03 Nov 2007 14:39:51 -0700
- To: Dan Connolly <connolly@w3.org>
- CC: public-i18n-core@w3.org, www-archive@w3.org, Chris.Wilson@microsoft.com
Hi Dan, The Internationalization Core WG was concerned by the thread in the subject line (see [1]) about encoding detection in HTML5 and discussed it in this week's teleconference. This note represents the WG's position at this time. Please copy to your WG lists as appropriate. We do note your exchange with Martin Dürst earlier this week [2]. We have not yet discussed all of the points Martin raise (personally, I note that I am sympathetic to a number of his points). We also reviewed the HTML5 editor's copy directly before responding. Basically, we agree with Martin's point that the email thread is poorly titled. The body of section 8 in HTML5 does not make windows-1252 the default encoding for HTML. However, the text quoted in [1] did bother us sufficiently to comment now. The quoted text is: -- Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative. Due to its use in legacy content, windows-1252 is recommended as a default in predominantly Western demographics. In non-legacy environments, the more comprehensive UTF-8 encoding is recommended instead. Since these encodings can in many cases be distinguished by inspection, a user agent may heuristically decide which to use as a default. -- Our comment is that this is a pretty weak recommendation. It is difficult to say what a "Western demographic" means in this context. We think we know why this is here: untagged HTML4 documents have a default character encoding of ISO 8859-1, so it is unsurprising to assume its common superset encoding when no other encoding can be guessed. However, we would like to see several things happen here: 1. It never actually says anywhere why windows-1252 must be used instead of ISO 8859-1. This might not be a job for HTML5 (perhaps our WG should publish a WG Note on the topic that HTML5 could reference??), but it does bear mentioning somewhere. People who don't know the relationship between these encodings might find the sudden appearance of windows-1252 "out of the blue" mystifying. 2. As quoted, it seems to (but does not actually) favor 1252 over UTF-8. Since UTF-8 is highly detectable and also the best long-term general default, we'd prefer if the emphasis were reversed, dropping the reference to "Western demographics". For example: -- Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative. UTF-8 is recommended as a default encoding in most cases. Due to its use in legacy content, windows-1252 is also recommended as a default. Since these encodings can usually be distinguished by inspection, a user agent may heuristically decide which to use as a default. -- 3. Possibly something should be said (elsewhere, not in this paragraph) about using other "superset" encodings in preference to the explicitly named encoding (that is, other encodings bear the same relationship as windows-1252 does to iso8859-1 and user-agents actually use these encodings to interpret pages and/or encode data in forms, etc.) In researching this note, I personally noted this section with some satisfaction, although I'm sure we might develop other comments as the WG digests it. As always, if I can help facilitate discussion between our groups or clarify in any well, please let me know. Several of us are available at this coming week's TPAC, in case your WG would like to discuss these issues directly. Best Regards (for the I18N Core WG), Addison -- Addison Phillips Globalization Architect -- Yahoo! Inc. Chair -- W3C Internationalization Core WG Internationalization is an architecture. It is not a feature. [1] http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0063.html [2] http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0072.html
Received on Saturday, 3 November 2007 21:40:16 UTC