Re: HTML5 vs content type sniffing

Henrik Nordström wrote:

> I also support removing the strict default ISO-8859-1 charset
> from HTTP text/* types, downgrading it to just a mere suggestion
> that if there is no charset information available then a good
> guess for the text/* types is ISO-8859-1 for historical reasons.

Apparently all agreed that the "strict default Latin-1" should go.
One way to to handle this situation is to do whatever MIME and 
the specification of the Content-Type (if given) offer, and for
historical reasons "Latin-1" can be a good guess.

In theory.  But unsurprisingly the HTML5 draft tells us that it
is in practice often not good enough.  Often "wannabe Latin-1"
turns out to be "windows-1252".  If you want to suggest a "best
guess" in the HTTP spec. for historical reasons please mention
windows-1252, it is an important difference for some documents:

Latin-1 C1 controls may be not permitted, and windows-1252 0x80
is the only (*) backwards compatible way to say €.  

While that's IMO irrelevant for HTTP, if you decide to talk about
it anyway let's get it right:  Latin-1 is a "historical" charset,
windows-1252 is the "real" legacy.

 Frank

*: Agents knowing Latin-9 etc. would also know Unicode or €

Received on Tuesday, 5 February 2008 16:31:04 UTC