- From: Leif Halvard Silli <lhs@malform.no>
- Date: Tue, 02 Jun 2009 18:17:33 +0200
- To: Henri Sivonen <hsivonen@iki.fi>
- CC: Larry Masinter <masinter@adobe.com>, Chris Wilson <Chris.Wilson@microsoft.com>, Maciej Stachowiak <mjs@apple.com>, "M.T. Carrasco Benitez" <mtcarrascob@yahoo.com>, Travis Leithead <Travis.Leithead@microsoft.com>, Erik van der Poel <erikv@google.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Harley Rosnow <Harley.Rosnow@microsoft.com>
Henri Sivonen On 09-06-02 10.10: > On Jun 1, 2009, at 20:44, Larry Masinter wrote: > >> Chris, in your note below you claim that the "current de >> facto" value was "Win1252" which seems to contradict what I >> thought was claimed in another message that the "de facto" >> default was "unknown" (which was my understanding, i.e., that >> browsers used a wide variety of heuristics to determine >> charset). > > The de facto default is Windows-1252 except for locales where > it isn't. [...] >> I'm interested in reducing ambiguity and making web >> transactions more reliable, and associating a new version >> indicator (DOCTYPE) with a more constrained default (charset >> default UTF8, rather than 'unknown') is reasonable, while I >> also would be opposed to making an incompatible change with >> actual current behavior. > > We already have 3 reliable version indicators for encoding axis > of versioning: charset=utf-8 on the HTTP layer charset=utf-8 in > <meta> the UTF-8 BOM > > We don't need a new indicator that wouldn't be as compatible > with existing user agents as the indicators we already have. > (Consider the Degrade Gracefully principle.) Like several others, your reply do not incorporate the authoring tools perspective that Larry contributed to this thread[1]. UTF-8 as default encoding for HTML 5 documents, already has wide support[2][3][4][5]. The question is how to actually bring this into the draft. It has to be more than a half hearted recommendation. It should be more in the direction of how utf-8/-16 is the default for XML - a conformance requirement. The draft several places sets a specification (for authors), only to tell how "real world" content should be treated (by browser applications). This same approach should be possible w.r.t. specifying UTF-8 as the default HTML 5 document encoding. The goal should be that authors, when they select to create a HTML 5 document, can take for granted that (conforming) tools defaults to UTF-8, unless the author actively select something else. And since you are working on an authoring tool: Reflecting such a requirement in validators is a challenge. Validator.nu currently - for "Text Field" validation, does not give any kind of reaction if authors fail to insert encoding information. This should be dealt with. - for "Address" validation, * if there are no non-ASCII characters, and the encoding has not been declared, Validator.nu displays a warning but no error message. [6] * if such pages do contain non-ASCII characters, Validator.nu displays an error _and_ informs that it has assumed Windows 1252.[7] Authors do not need to know, when validating, that browsers assume Windows 1252. Less so, if UTF-8 is defined as the default HTML 5 encoding. Validator.nu should not behave like a browser here (which it doesn't do anyhow, as it does not seem to operate with locale defaults), this will not bring any improvements. Instead, Validator.nu should - for documents with the HTML 5 doctype - assume the proposed _default_ charset - UTF-8. Lack of any encoding info anywhere should always count as error. Specifically, lack of a meta element with charset info should probably count as error. Bob [2], Maciej [4], anything to add? > On Jun 2, 2009, at 03:48, Leif Halvard Silli wrote: > >> Is it the choice of UTF-8 as default you don't understand? If >> so, then I'd like to quote the "Support World Languages" >> principle. > > The Support World Languages principle is satisfied by HTML5 > allowing authors easily to opt in to UTF-8. It has to be opt in > due to the Support Existing Content and Degrade Gracefully > principles. Here you touch the author perspective. However, HTML 4 already "Support World Languages", including Unicode. If this principle is to have any meaning, then it must be applied to /extend/ the support for world languages, including the "mixing of text in different languages" - as called out for in that principle. We care for _real world support_, and real world support calls for the definition of UTF-8 as the default charset. [1] http://lists.w3.org/Archives/Public/public-html/2009Jun/0069 [2] http://lists.w3.org/Archives/Public/public-html/2009Jun/0068 [3] http://lists.w3.org/Archives/Public/public-html/2009Jun/0067 [4] http://lists.w3.org/Archives/Public/public-html/2009Jun/0066 [5] http://lists.w3.org/Archives/Public/public-html/2009Jun/0099 [6] http://validator.nu/?doc=http%3A%2F%2Fmalform.no%2Fhtml5%2FnoNonASCII.html&parser=html5 [7] http://validator.nu/?doc=http%3A%2F%2Fmalform.no%2Fhtml5%2FsomeNonASCII.html&parser=html5 -- leif halvard silli
Received on Tuesday, 2 June 2009 16:18:16 UTC