RE: Auto-detect and encodings in HTML5

Not only CJK and Cyrillic, also Hebrew and I suppose many other non-Latin
languages.

Jony

-----Original Message-----
From: www-international-request@w3.org
[mailto:www-international-request@w3.org] On Behalf Of Henri Sivonen
Sent: Monday, June 01, 2009 10:49 AM
To: M.T.Carrasco Benitez
Cc: Travis Leithead; Erik van der Poel; public-html@w3.org;
www-international@w3.org; Richard Ishida; Ian Hickson; Chris Wilson; Harley
Rosnow
Subject: Re: Auto-detect and encodings in HTML5

On May 31, 2009, at 11:18, M.T. Carrasco Benitez wrote:

> Near to Erik, but UTF8 in worse case:
>
> 1) Best: HTTP charset; unambiguous and "external"
> 2) Agree on ONE public detection algorithm
> 3) Mandatory declaration as near to the top as possible; if in META,  
> the first in HEAD; within a certain range of bytes (e.g., 512)

We tried "first in HEAD" as a document conformance requirement, and it  
was way too annoying with validator messages when updating old sites.  
"Within a certain range of bytes" strikes the right balance between  
performance and existing authoring practices.

> 4) Default UTF8 could be part of the algorithm; perhaps the last  
> option

This is not feasible considering out "support existing content" design  
principle. If there's only a single last-resort default, it must be  
Windows-1252 to have the best world-wide coverage of existing content.  
(Future non-Latin and Latin content can explicitly opt into UTF-8.)

Unfortunately, having only Windows-1252 as the default without having  
a sniffing algorithm that makes CJK and Cyrillic content not reach the  
default would be bad for market share in CJK and Cyrillic locales. If  
we want to get rid of the locale-dependent variability of the last- 
resort default, we need to have a single normative heuristic detection  
algorithm that is so good that CJK and Cyrillic encodings are guessed  
right from the first 512 to 1024 bytes (i.e. mostly <title>).

And then we'd need a desktop browser vendor who is willing to be the  
first one to remove the UI for setting the last-resort default  
encoding--for all modes that the browser has for text/html.

UTF-8 never makes sense as the last resort for text/html, because  
UTF-8 in text/html has always been opt-in for authors, so logically  
there should be much less unlabeled existing content that assumes  
UTF-8 than that assumes a legacy encoding.

> 5) No BOM or similar


Compatibility with existing implementations requires the UTF-16 BOMs  
and the UTF-8 BOMs to be treated as encoding signatures whose  
authority ranks higher than <meta>.

On May 31, 2009, at 20:37, Erik van der Poel wrote:

> I agree that it would be interesting if major HTML5 implementers and
> (the) HTML5 spec writer(s) would agree on a UTF-8 default charset.
>
> Just to make the HTML5 "version indicator" a bit more explicit, might
> this be something like the following HTTP response header?
>
> Content-Type: text/html; version=5; charset=gb2312


Content-Type: text/html; version=5 doesn't default to UTF-8 is  
existing user agents but if you can set headers, you can already do  
Content-Type: text/html; charset=utf-8.

The HTML WG discussed versioning at length in March and April 2007. I  
suggest not reopening that debate.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 1 June 2009 08:10:00 UTC