[whatwg] Content type sniffing from Adam Barth on 2009-01-12 (public-whatwg-archive@w3.org from January 2009)

From: Adam Barth <whatwg@adambarth.com>
Date: Mon, 12 Jan 2009 09:12:31 -0800
Message-ID: <7789133a0901120912jda5a8bfp9dd1447f38bcf8c@mail.gmail.com>

On Mon, Jan 12, 2009 at 7:54 AM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> I'm not quite sure what to make of this, actually.  Specifically, where is
> the "22.19%" number for "HTML Tags" coming from?  22.19% of what? The magic
> numbers stuff actually adds up to 100%, but of what?

Sorry, the % was confusing.  I've removed them.  These table are the
relative frequency with which those rules fire in the content sniffer.
 Probably should have scaled them all to be out of 100 or out of 1,
but it was more convenient to scale them out of the totals that I did.

>> I'm sympathetic to adding more HTML tags to the list, but I'm not sure
>> how far down the tail we should go.  In Chrome, we went for 99.999%
>> compatibility, which might be a bit far down the tail.
>
> Doesn't seem that way to me, given the number of web pages out there.

I don't think it makes sense to compare that percentage to the number
of web pages.  Instead, imagine a user who views 100 pages a day.
That user will, in a crude "average" sense, come across a broken web
page once every three years.

>> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?view=markup
>
> Ah, ok.  The relevant Gecko code is
> <http://hg.mozilla.org/mozilla-central/annotate/9f82199fdb9c/netwerk/streamconv/converters/nsUnknownDecoder.cpp#l477>.

Yes, I've examined that code in detail.  :)  Here is a web page that
will let you compare the sniffing algorithms used by four popular
browsers:

http://webblaze.cs.berkeley.edu/2009/content-sniffing/

> I'd probably be fine with trimming that list down a bit, but I'm not quite
> sure what the downsides of having more tags in it are here.

Most of the cost is complexity (which leads to security
vulnerabilities).  People who let users upload content and who build
firewalls that filter content at the application layer (for example,
to look for malware) need to understand browser content sniffing
algorithms in order to build secure products.  There is a huge
complexity win for standardizing the algorithm across multiple
implementations, and there is a small complexity loss for each
sniffing heuristic we add.

One plan for going forward is to resolve
https://bugzilla.mozilla.org/show_bug.cgi?id=465007 and then open
another bug for harmonizing the HTML heuristic (with the expectation
that harmonization will probably involve changing both the spec and
the implementation).

Adam

Received on Monday, 12 January 2009 09:12:31 UTC