W3C home > Mailing lists > Public > www-international@w3.org > April to June 2009

Re: Auto-detect and encodings in HTML5

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 01 Jun 2009 21:07:11 -0400
Message-ID: <4A247B3F.6050300@mit.edu>
To: Leif Halvard Silli <lhs@malform.no>
CC: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>
Leif Halvard Silli wrote:
> Maciej Stachowiak On 09-06-02 00.38:
>> Making the doctype switch the default from Windows-1252 to UTF-8 will 
>> mean only ASCII documents work correctly in both older and newer user 
>> agents, unless the author explicitly declares an encoding.
[etc]

> There is one aspect that you are - again - forgetting, and that is 
>  authoring tools and web servers.

I don't think Maciej forgot anything like that.  He's talking about the 
proposal that was made: that HTML consumers (not producers) default to 
UTF-8 whenever they see "<!DOCTYPE html>".  He is clearly talking about 
the case "unless the author explicitly declares an encoding", where 
"author" is anything that's producing HTML.  "declares an encoding" 
could take the form of an HTTP header or a <meta> tag in the HTML.

> If complying authoring tools had to default to UTF-8 whenever someone 
> select to create a HTML 5 document (much the same way that XML default 
> to UTF-8/-16), then that would be a bonus and simplification and 
> _motivation_ for using HTML 5.

Presumably by "default" you mean encode it as UTF-8 and then include the 
appropriate <meta> tag?  That sounds like a pretty good idea to me.

> 
> The next level should be that web servers defaults to sending a charset 
> header which said "UTF-8" whenever they saw the HTML 5 doctype.

Very few web servers look inside the document content when deciding on 
headers.  I don't believe the two most common ones (Apache and IIS) do 
so by default....

> Thus we could leave the Web browser behaviour as drafted, but require 
> utf-8 as default from serves and authoring tools.

I doubt you'll hear any browser developers complaining about this!  I 
certainly have no objections to it.  If authoring tools do in fact 
behave this way, then maybe at some point (decades from now, I suspect) 
we'll get to a world where we can start dropping support for encodings 
that are no longer in use because the documents have been transcoded to 
UTF-8 in the meantime.... Would be nice.

-Boris
Received on Tuesday, 2 June 2009 01:08:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:19 GMT