Re: Content Sniffing impact on HTTPbis - #155 from Bjoern Hoehrmann on 2009-06-05 (ietf-http-wg@w3.org from April to June 2009)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Fri, 05 Jun 2009 18:14:46 +0200
To: Adam Barth <w3c@adambarth.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <fjdi25pvldvgagnkuofrfo4hjjtimma852@hive.bjoern.hoehrmann.de>

* Adam Barth wrote:
>For which parts would you like a more detailed rationale?  It's hard
>for me to guess which parts you think are obscure.

I've already mentioned the encoding extraction algorithm, but to add
some others: in draft-abarth-mime-sniff-01 section 3 step 3's special
handling of very particular sequences, the handling of unregistered
and malformed values in step 5, the special handling of XML types in
step 6, the relevance of the implementation supporting particular types
in step 7.

In section 4 why implementations may decide to pick any number of bytes
between 0 and 512, why step 3 only applies when you have at least three
bytes and then only compares two bytes, why the UTF-32 BOM is not being
detected, why step four has those bytes and not others; in section 6 the
special handling of image/svg+xml; in section 7 why the UTF-16 BOM is
ignored.

>The document defines algorithms for extracting information from the
>Content-Type.  That algorithm, in particular, extracts the charset
>attribute from the Content-Type header.  The algorithm is intended to
>be reference by other specifications, such as HTML 5, which need to
>determine the charset attribute of the Content-Type header is a manner
>compatible with existing web content.

I see no justification for having a special algorithm for the charset
parameter; you extract the parameter just like any other. I also don't
know of any implementation that processes the header value like that;
if you have

  text/plain;whatever="charset=iso-8859-2";charset=iso-8859-3

Then the result of your algorithm is iso-8859-2", while the correct be-
havior yields iso-8859-3, which is also what IE6, FF 3.x, Opera 9, and
various non-browser applications use. The same goes for a simpler:

  text/plain;whatever="charset";charset=iso-8859-3

Where your algorithm returns nothing, and implementations implement the
correct behavior, which yields iso-8859-3. There also appears to be no
need to process escape sequences within quoted strings incorrectly, for
instance Opera 9 seems to implement that properly, so does my own code.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Friday, 5 June 2009 16:15:22 UTC