- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Mon, 28 Apr 2008 19:14:17 +0300
- To: olivier Thereaux <ot@w3.org>
- Cc: W3C Validator Community <www-validator@w3.org>
On Apr 28, 2008, at 04:43 , olivier Thereaux wrote: > On 24-Apr-08, at 5:10 PM, Henri Sivonen wrote: >> More precisely for text/html: >> http://www.w3.org/html/wg/html5/#determining >> >> Step 7. defines Windows-1252 as the general default which can be >> different in non-Western browser installations. Global online apps >> like validators should probably stick to Windows-1252. > > Henri, this is an interesting and important statement in the HTML5 > spec. How does the group feel about the inconsistency this created > between the spec and defaults stated by other specifications, such as > > http://www.ietf.org/rfc/rfc2854.txt > “ Section 3.7.1, defines that "media subtypes of the 'text' type are > defined to have a default charset value of 'ISO-8859-1'".” > (ditto RFC 2616) > > This is the inconsistency at the core of the issue, isn't it. > > I heard that the group working on HTTPbis had considered changing > the default, but had not managed to reach consensus yet. I'd rather not speak for the HTML WG as a group. However, my own take on this is that what HTML 5 now says closely reflects what browsers already do. Specs that say something notably different will in most likelihood end up being irrelevant to writing software for consuming text/html content for non-validation purposes. I think it isn't useful for validators to diverge from other text/html consumers on this point. > Is the HTML WG considering updating rfc2854? Not to my knowledge, although the WG probably should in due course. >> (The mention of UTF-8 there is a token gesture; the Web is a legacy >> system, so UTF-8 for non-legacy does not apply.) > > This sounds rather like a subjective statement, which I would be > wary of. Of course, the HTML5 spec is here to fix things in a > backward-compatible way, but specifications are forward looking, not > just back - and checkers are here in part to help move the landscape > futureward. Or, at least, so am I told all the time by the likes of > timbl :). > > I also note in the HTML5 specification: > “Authors are encouraged to use UTF-8. Conformance checkers may > advise against authors using legacy encodings.” > > So is this a question of a future-looking default (utf8) versus > conservative default (win1252)? If so, I would argue that a checker > should favor utf8 first, and fallback to win1252 second, no? I think futureward is *declared* UTF-8. Indeed HTML5 encourages authors to use UTF-8 but not by relying on defaulting without declaration. For *general* (i.e. non-validator) HTML consumption advice, defaulting to UTF-8 is a rather bad idea given existing content. Windows-1252, GBK, Big5, Shift_JIS, EUC-KR, etc. depending on context are all better default guesses when the encoding has not been declared. Anyway, I think the crux of what HTML5 says on this issue for validation is that non-declared non-ASCII is an error regardless of what ASCII-superset default was used to get far enough to detect that situation. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Monday, 28 April 2008 16:15:02 UTC