- From: Jonathan Rosenne <rosennej@qsm.co.il>
- Date: Mon, 1 Jun 2009 11:08:18 +0300
- To: "'Henri Sivonen'" <hsivonen@iki.fi>, "'M.T.Carrasco Benitez'" <mtcarrascob@yahoo.com>
- Cc: "'Travis Leithead'" <Travis.Leithead@microsoft.com>, "'Erik van der Poel'" <erikv@google.com>, <public-html@w3.org>, <www-international@w3.org>, "'Richard Ishida'" <ishida@w3.org>, "'Ian Hickson'" <ian@hixie.ch>, "'Chris Wilson'" <Chris.Wilson@microsoft.com>, "'Harley Rosnow'" <Harley.Rosnow@microsoft.com>
Not only CJK and Cyrillic, also Hebrew and I suppose many other non-Latin languages. Jony -----Original Message----- From: www-international-request@w3.org [mailto:www-international-request@w3.org] On Behalf Of Henri Sivonen Sent: Monday, June 01, 2009 10:49 AM To: M.T.Carrasco Benitez Cc: Travis Leithead; Erik van der Poel; public-html@w3.org; www-international@w3.org; Richard Ishida; Ian Hickson; Chris Wilson; Harley Rosnow Subject: Re: Auto-detect and encodings in HTML5 On May 31, 2009, at 11:18, M.T. Carrasco Benitez wrote: > Near to Erik, but UTF8 in worse case: > > 1) Best: HTTP charset; unambiguous and "external" > 2) Agree on ONE public detection algorithm > 3) Mandatory declaration as near to the top as possible; if in META, > the first in HEAD; within a certain range of bytes (e.g., 512) We tried "first in HEAD" as a document conformance requirement, and it was way too annoying with validator messages when updating old sites. "Within a certain range of bytes" strikes the right balance between performance and existing authoring practices. > 4) Default UTF8 could be part of the algorithm; perhaps the last > option This is not feasible considering out "support existing content" design principle. If there's only a single last-resort default, it must be Windows-1252 to have the best world-wide coverage of existing content. (Future non-Latin and Latin content can explicitly opt into UTF-8.) Unfortunately, having only Windows-1252 as the default without having a sniffing algorithm that makes CJK and Cyrillic content not reach the default would be bad for market share in CJK and Cyrillic locales. If we want to get rid of the locale-dependent variability of the last- resort default, we need to have a single normative heuristic detection algorithm that is so good that CJK and Cyrillic encodings are guessed right from the first 512 to 1024 bytes (i.e. mostly <title>). And then we'd need a desktop browser vendor who is willing to be the first one to remove the UI for setting the last-resort default encoding--for all modes that the browser has for text/html. UTF-8 never makes sense as the last resort for text/html, because UTF-8 in text/html has always been opt-in for authors, so logically there should be much less unlabeled existing content that assumes UTF-8 than that assumes a legacy encoding. > 5) No BOM or similar Compatibility with existing implementations requires the UTF-16 BOMs and the UTF-8 BOMs to be treated as encoding signatures whose authority ranks higher than <meta>. On May 31, 2009, at 20:37, Erik van der Poel wrote: > I agree that it would be interesting if major HTML5 implementers and > (the) HTML5 spec writer(s) would agree on a UTF-8 default charset. > > Just to make the HTML5 "version indicator" a bit more explicit, might > this be something like the following HTTP response header? > > Content-Type: text/html; version=5; charset=gb2312 Content-Type: text/html; version=5 doesn't default to UTF-8 is existing user agents but if you can set headers, you can already do Content-Type: text/html; charset=utf-8. The HTML WG discussed versioning at length in March and April 2007. I suggest not reopening that debate. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Monday, 1 June 2009 08:10:00 UTC