I18N Core WG response to: RE: HTML 5 defaults to Windows-1252 (etc.)

Hi Dan,

The Internationalization Core WG was concerned by the thread in the 
subject line (see [1]) about encoding detection in HTML5 and discussed 
it in this week's teleconference. This note represents the WG's position 
at this time. Please copy to your WG lists as appropriate.

We do note your exchange with Martin Dürst earlier this week [2]. We 
have not yet discussed all of the points Martin raise (personally, I 
note that I am sympathetic to a number of his points). We also reviewed 
the HTML5 editor's copy directly before responding.

Basically, we agree with Martin's point that the email thread is poorly 
titled. The body of section 8 in HTML5 does not make windows-1252 the 
default encoding for HTML. However, the text quoted in [1] did bother us 
sufficiently to comment now.

The quoted text is:

-- 
Otherwise, return an implementation-defined or user-specified default 
character encoding, with the confidence tentative. Due to its use in 
legacy content, windows-1252 is recommended as a default in 
predominantly Western demographics. In non-legacy environments, the more 
comprehensive UTF-8 encoding is recommended instead. Since these 
encodings can in many cases be distinguished by inspection, a user agent 
may heuristically decide which to use as a default.
-- 

Our comment is that this is a pretty weak recommendation. It is 
difficult to say what a "Western demographic" means in this context. We 
think we know why this is here: untagged HTML4 documents have a default 
character encoding of ISO 8859-1, so it is unsurprising to assume its 
common superset encoding when no other encoding can be guessed.

However, we would like to see several things happen here:

1. It never actually says anywhere why windows-1252 must be used instead 
of ISO 8859-1. This might not be a job for HTML5 (perhaps our WG should 
publish a WG Note on the topic that HTML5 could reference??), but it 
does bear mentioning somewhere. People who don't know the relationship 
between these encodings might find the sudden appearance of windows-1252 
"out of the blue" mystifying.

2. As quoted, it seems to (but does not actually) favor 1252 over UTF-8. 
Since UTF-8 is highly detectable and also the best long-term general 
default, we'd prefer if the emphasis were reversed, dropping the 
reference to "Western demographics". For example:

-- 
Otherwise, return an implementation-defined or user-specified default 
character encoding, with the confidence tentative. UTF-8 is recommended 
as a default encoding in most cases. Due to its use in legacy content, 
windows-1252 is also recommended as a default. Since these encodings can 
usually be distinguished by inspection, a user agent may heuristically 
decide which to use as a default.
-- 

3. Possibly something should be said (elsewhere, not in this paragraph) 
about using other "superset" encodings in preference to the explicitly 
named encoding (that is, other encodings bear the same relationship as 
windows-1252 does to iso8859-1 and user-agents actually use these 
encodings to interpret pages and/or encode data in forms, etc.)

In researching this note, I personally noted this section with some 
satisfaction, although I'm sure we might develop other comments as the 
WG digests it.

As always, if I can help facilitate discussion between our groups or 
clarify in any well, please let me know. Several of us are available at 
this coming week's TPAC, in case your WG would like to discuss these 
issues directly.

Best Regards (for the I18N Core WG),

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.

[1] 
http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0063.html
[2] 
http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0072.html

Received on Saturday, 3 November 2007 21:40:18 UTC