Forwarded with permission...
A quick read suggests these are editorial suggestions and not
substantive design change requests...
--
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Forwarded message 1
Hi Dan,
The Internationalization Core WG was concerned by the thread in the
subject line (see [1]) about encoding detection in HTML5 and discussed
it in this week's teleconference. This note represents the WG's position
at this time. Please copy to your WG lists as appropriate.
We do note your exchange with Martin Dürst earlier this week [2]. We
have not yet discussed all of the points Martin raise (personally, I
note that I am sympathetic to a number of his points). We also reviewed
the HTML5 editor's copy directly before responding.
Basically, we agree with Martin's point that the email thread is poorly
titled. The body of section 8 in HTML5 does not make windows-1252 the
default encoding for HTML. However, the text quoted in [1] did bother us
sufficiently to comment now.
The quoted text is:
--
Otherwise, return an implementation-defined or user-specified default
character encoding, with the confidence tentative. Due to its use in
legacy content, windows-1252 is recommended as a default in
predominantly Western demographics. In non-legacy environments, the more
comprehensive UTF-8 encoding is recommended instead. Since these
encodings can in many cases be distinguished by inspection, a user agent
may heuristically decide which to use as a default.
--
Our comment is that this is a pretty weak recommendation. It is
difficult to say what a "Western demographic" means in this context. We
think we know why this is here: untagged HTML4 documents have a default
character encoding of ISO 8859-1, so it is unsurprising to assume its
common superset encoding when no other encoding can be guessed.
However, we would like to see several things happen here:
1. It never actually says anywhere why windows-1252 must be used instead
of ISO 8859-1. This might not be a job for HTML5 (perhaps our WG should
publish a WG Note on the topic that HTML5 could reference??), but it
does bear mentioning somewhere. People who don't know the relationship
between these encodings might find the sudden appearance of windows-1252
"out of the blue" mystifying.
2. As quoted, it seems to (but does not actually) favor 1252 over UTF-8.
Since UTF-8 is highly detectable and also the best long-term general
default, we'd prefer if the emphasis were reversed, dropping the
reference to "Western demographics". For example:
--
Otherwise, return an implementation-defined or user-specified default
character encoding, with the confidence tentative. UTF-8 is recommended
as a default encoding in most cases. Due to its use in legacy content,
windows-1252 is also recommended as a default. Since these encodings can
usually be distinguished by inspection, a user agent may heuristically
decide which to use as a default.
--
3. Possibly something should be said (elsewhere, not in this paragraph)
about using other "superset" encodings in preference to the explicitly
named encoding (that is, other encodings bear the same relationship as
windows-1252 does to iso8859-1 and user-agents actually use these
encodings to interpret pages and/or encode data in forms, etc.)
In researching this note, I personally noted this section with some
satisfaction, although I'm sure we might develop other comments as the
WG digests it.
As always, if I can help facilitate discussion between our groups or
clarify in any well, please let me know. Several of us are available at
this coming week's TPAC, in case your WG would like to discuss these
issues directly.
Best Regards (for the I18N Core WG),
Addison
--
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG
Internationalization is an architecture.
It is not a feature.
[1]
http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0063.html
[2]
http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0072.html