Re: HTML5 discussions regarding charset determination and sniffing

On 01.10.2010 01:07, Bjoern Hoehrmann wrote:
> * Julian Reschke wrote:
>> The background is that HTML5 specifies an algorithm for extracting the
>> charset from content type information, which (1) requires accepting
>> invalid forms (single quotes), and (2) requires not to properly handle
>> escapes in quoted strings.
>
> Usually, what happens if you decide to ignore the standard and make your
> own rules, you introduce subtle problems that you had not thought about.
> As http://lists.w3.org/Archives/Public/ietf-http-wg/2009AprJun/0504.html
> I noted some time ago, the algorithm Ian proposes is inconsistent with
> the HTTP specification and HTTP implementations such as browsers in its
> handling of strings like
>
>    text/plain;whatever="charset=iso-8859-2";charset=iso-8859-3

That algorithm was in the spec since spring, and I raised a bug late 
April. It was finally processed last month. The fuzzy matching now is 
out (despite the author claimed he rejected my bug report). So 
apparently this was a "willful violation" that wasn't based on evidence, 
just on sloppiness.

There are two more cases (see the current open tracker issues) left.

> as the algorithm does not handle quoted strings at all and just does a
> stateless scan for "charset". That of course concerned processing the
> HTTP Content-Type header, the<meta>  element is different and the HTML
> specification could only violate HTTP there if it pretended that<meta>
> has much to do with HTTP. If you compare the case above at the HTTP and
> at the<meta>  level you should find some browsers use different parsers
> for them.

I suspect that as well. What's needed is proper testing (both for <meta> 
and the HTTP header). With the current state of the spec, we may end up 
with broken parsers for <meta> leaking out into HTTP header parsing.

Best regards, Julian

Received on Friday, 1 October 2010 08:19:48 UTC