Re: HTML5 discussions regarding charset determination and sniffing from Bjoern Hoehrmann on 2010-09-30 (www-tag@w3.org from September 2010)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Fri, 01 Oct 2010 01:07:52 +0200
To: Julian Reschke <julian.reschke@gmx.de>
Cc: "www-tag@w3.org" <www-tag@w3.org>
Message-ID: <0m1aa6h9gvgog9ui2l9c9am7b4404i6659@hive.bjoern.hoehrmann.de>

* Julian Reschke wrote:
>The background is that HTML5 specifies an algorithm for extracting the 
>charset from content type information, which (1) requires accepting 
>invalid forms (single quotes), and (2) requires not to properly handle 
>escapes in quoted strings.

Usually, what happens if you decide to ignore the standard and make your
own rules, you introduce subtle problems that you had not thought about.
As http://lists.w3.org/Archives/Public/ietf-http-wg/2009AprJun/0504.html
I noted some time ago, the algorithm Ian proposes is inconsistent with
the HTTP specification and HTTP implementations such as browsers in its
handling of strings like

  text/plain;whatever="charset=iso-8859-2";charset=iso-8859-3

as the algorithm does not handle quoted strings at all and just does a
stateless scan for "charset". That of course concerned processing the
HTTP Content-Type header, the <meta> element is different and the HTML
specification could only violate HTTP there if it pretended that <meta>
has much to do with HTTP. If you compare the case above at the HTTP and
at the <meta> level you should find some browsers use different parsers
for them.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Thursday, 30 September 2010 23:08:29 UTC