RE: Auto-detect and encodings in HTML5

Changing the default charset from *something
well known* to *something else* would be a bad
idea -- that would be "default charset switching".

But changing the charset from "known, please guess"
to "UTF-8" doesn't seem like it is "default
charset switching", it's "default charset 
setting".

Setting default charset setting may not be
a good reason for a version indicator, but
it's a supporting reason.

If there were other reasons for having a version
indicator (e.g., to support authoring requirements),
the version indicator could also indicate default
charset UTF8.

Larry
--
http://larry.masinter.net


-----Original Message-----
From: Maciej Stachowiak [mailto:mjs@apple.com] 
Sent: Sunday, May 31, 2009 3:35 PM
To: Larry Masinter
Cc: M.T. Carrasco Benitez; Travis Leithead; Erik van der Poel; public-html@w3.org; www-international@w3.org; Richard Ishida; Ian Hickson; Chris Wilson; Harley Rosnow
Subject: Re: Auto-detect and encodings in HTML5


On May 31, 2009, at 8:05 AM, Larry Masinter wrote:

> I believe the stance of most of the participants in the
> HTML working group is that no "version indicator" for
> HTML5 is necessary, and there is no specific
> "HTML5 doctype", against which newer, or stricter,
> behavior can be keyed.
>
> If charset defaulting is a reason for having a specific
> HTML5 version indicator, in order to trigger a stricter
> interpretation, say, of the default charset, that would
> be interesting.

I think it would be pretty poor if some indicator of the document  
version (e.g. the doctype or as suggested by someone else a version  
parameter in the Content-Type header) changed the default charset.  
There are two reasons I say this:

1) It goes against our desire to allow for gradual adoption. If  
changing your doctype declaration could have the side effect of  
changing your charset from Windows-1252 ("Windows Latin-1") to UTF-8,  
that would be a serious risk of breaking upgraded documents.

2) Doctype and Content-type parameter are both opt-in mechanisms. But  
there's already explicit ways to opt in to UTF-8: the charset  
parameter on Content-type, or a <meta> tag in the document. Explicit  
opt-in seems better to me than implicit, since it's more likely the  
author will be making a change intentionally.

It would be convenient if UTF-8 could be the default character set,  
but we can't safely apply that to legacy content, so we can't do it.  
Having it be the default under an opt-in doesn't really make it the  
default, it just adds a way to ask for UTF-8, though a subtle and  
implicit one. And the benefit does not seem great enough to add an  
additional implicit opt-in. WinLatin1 is not a broken encoding, and  
opting in to UTF-8 is already quite simple.

Regards,
Maciej

Received on Sunday, 31 May 2009 22:46:05 UTC