Re: HTML5 vs content type sniffing from Henrik Nordström on 2008-02-05 (ietf-http-wg@w3.org from January to March 2008)

From: Henrik Nordström <henrik@henriknordstrom.net>
Date: Tue, 05 Feb 2008 15:45:22 +0100
To: Julian Reschke <julian.reschke@gmx.de>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <1202222722.17924.101.camel@hlaptop>

lÃ¶r 2008-02-02 klockan 12:46 +0100 skrev Julian Reschke:
> The spec currently requires sniffing for "text/plain; 
> charset=iso-8859-1" and "text/plain; charset=ISO-8859-1", assuming that 
> those servers that do send an incorrect default content type always send 
> it with a very specific character set name. It appears that some servers 
> sometimes ship with other defaults, thus more character sets would need 
> to be considered 
> (<http://lists.w3.org/Archives/Public/public-html/2008Jan/0239.html>). 
> Where do you draw the line?

I follow the small crowd who likes what the HTTP rfc says. If the server
says something then it's explicit, and sniffing should only be allowed
when there is no explicit information to go on.

I.e. there is perfectly valid reasons for a server to say that a file is
of type text/plain instead of text/html. Having an user agent guess that
the content should be displayed as HTML only because it looks like it is
HTML is plain wrong. It's equally valid for a server to say it's
ISO-8859-1 even if it looks like the content may be UTF-8 as there may
be ISO-8859-1 code sequences in there or perhaps the purpose is simply
to let the user know how odd it renders when rendered in the wrong
characterset.

Having user agents actively work around server misconfigurations is just
wrong. All this does is delaying getting the actual problem fixed, and
moving the burden of getting the problem fixed from the server / webbapp
maintainer who caused the problem to the user agent vendors.

Once you start going down the road of routinely secondguessing the
intentions of the server or webapp then you enter a never ending road,
making sure that these problems will stay forever and never get fixed.

So to summarise my preferences:

Content-Type guessing MAY be performed ONLY and ONLY IF there is no
Content-Type specified. (already a MUST level criteria in the RFC)

charset parameter guessing MAY be performed ONLY and ONLY IF there is no
charset parameter specified. (currently a MUST NOT in the RFC. charset
guessing is currently never allowed)


Related to this I also support removing the strict default ISO-8859-1
charset from HTTP text/* types, downgrading it to just a mere suggestion
that if there is no charset information available then a good guess for
the text/* types is ISO-8859-1 for historical reasons.

> 4) other type of sniffing
> 
> HTML5 defines other types of sniffing (such as unknown -> PDF) that 
> aren't covered by these tests, and haven't been discussed within this 
> thread.

Already in the definition of Type. Not much to discuss.

  "If
   and only if the media type is not given by a Content-Type field, the
   recipient MAY attempt to guess the media type via inspection of its
   content and/or the name extension(s) of the URI used to identify the
   resource."

Regards
Henrik

Received on Tuesday, 5 February 2008 14:47:29 UTC