W3C home > Mailing lists > Public > public-html@w3.org > June 2009

Re: Auto-detect and encodings in HTML5

From: Leif Halvard Silli <lhs@malform.no>
Date: Tue, 02 Jun 2009 03:34:18 +0200
Message-ID: <4A24819A.4000802@malform.no>
To: Boris Zbarsky <bzbarsky@MIT.EDU>
CC: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>
Boris Zbarsky On 09-06-02 03.07:
> Leif Halvard Silli wrote:
>> Maciej Stachowiak On 09-06-02 00.38:
>>> Making the doctype switch the default from Windows-1252 to UTF-8 will 
>>> mean only ASCII documents work correctly in both older and newer user 
>>> agents, unless the author explicitly declares an encoding.
> [etc]
> 
>> There is one aspect that you are - again - forgetting, and that is 
>>  authoring tools and web servers.
> 
> I don't think Maciej forgot anything like that.  He's talking about the 
> proposal that was made: that HTML consumers (not producers) default to 
> UTF-8 whenever they see "<!DOCTYPE html>".  He is clearly talking about 
> the case "unless the author explicitly declares an encoding", where 
> "author" is anything that's producing HTML.  "declares an encoding" 
> could take the form of an HTTP header or a <meta> tag in the HTML.

My comment was related to what Larry said [1]:

    >If there were other reasons for having a version
    >indicator (e.g., to support authoring requirements),
    >the version indicator could also indicate default
    >charset UTF8.

Larry has repeatedly spoken about the needs of authoring tools 
e.g. w.r.t. versioning.

>> If complying authoring tools had to default to UTF-8 whenever someone 
>> select to create a HTML 5 document (much the same way that XML default 
>> to UTF-8/-16), then that would be a bonus and simplification and 
>> _motivation_ for using HTML 5.
> 
> Presumably by "default" you mean encode it as UTF-8 and then include the 
> appropriate <meta> tag?  That sounds like a pretty good idea to me.

Yes, indeed. As Larry said[2]: "Yes, supplying explicit charset is 
preferable, but ..."

The spec also talks about relying on BOM as an alternative - I 
guess /that/ should be conforming/required authoring tool 
behaviour as well?

>> The next level should be that web servers defaults to sending a 
>> charset header which said "UTF-8" whenever they saw the HTML 5 doctype.
> 
> Very few web servers look inside the document content when deciding on 
> headers.  I don't believe the two most common ones (Apache and IIS) do 
> so by default....

Perhaps Sam or Roy or someone from Microsoft can enlighten us if 
such a thing would be possible in Apache and IIS?

>> Thus we could leave the Web browser behaviour as drafted, but require 
>> utf-8 as default from serves and authoring tools.
> 
> I doubt you'll hear any browser developers complaining about this!  I 
> certainly have no objections to it.  If authoring tools do in fact 
> behave this way, then maybe at some point (decades from now, I suspect) 
> we'll get to a world where we can start dropping support for encodings 
> that are no longer in use because the documents have been transcoded to 
> UTF-8 in the meantime.... Would be nice.

Indeed. :-)

[1] http://lists.w3.org/Archives/Public/public-html/2009May/0654
[2] http://lists.w3.org/Archives/Public/public-html/2009Jun/0036
-- 
leif halvard silli
Received on Tuesday, 2 June 2009 01:35:00 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:38 GMT