Re: [whatwg] [mimesniff] Complete MIME type parsing algorithm for section 5 from Gordon P. Hemsley on 2013-05-28 (public-whatwg-archive@w3.org from May 2013)

From: Gordon P. Hemsley <gphemsley@gmail.com>
Date: Tue, 28 May 2013 17:21:38 -0400
To: Peter Occil <poccil14@gmail.com>
Cc: WHATWG <whatwg@whatwg.org>
Message-ID: <CAH4e3M4UrKhNbe3xHBu4fhDxUPUmriOCxxBB2XTpmfRJ7XYqmQ@mail.gmail.com>
Peter,

The main reason I haven't yet responded to your e-mails is because I'm
still actively working on improving and testing the algorithm.

But I do want you to know that your comments are valuable to me,
because they point out the areas I need to consider and test.

And while you should continue to bring inconsistencies with RFCs to my
attention, you should keep in mind that some of these inconsistencies
may be "willful violations".

The IETF has the power to restrict the format of the MIME types that
are formally registered, but they have little power over what winds up
deployed in the wild.

Browsers, on the other hand, need to know how to handle all sorts of
things that the IETF would consider invalid—and in many cases existing
browsers do things in violation of the RFCs.

Since one of the main goals of this spec, and the WHATWG as a whole,
is to improve interoperability, making the spec consistent with a
majority of browsers overrides making the spec consistent with
existing RFCs.

One specific comment I have about your latest e-mail: I think you
should read the algorithm again, because I'm fairly sure that it does
guard against empty values for type, subtype, and parameter names.
(But I'll check again.)

Regards,
Gordon

On Tue, May 28, 2013 at 4:25 PM, Peter Occil <poccil14@gmail.com> wrote:
>
> I see you've updated the MIME sniffing algorithm in response to my feedback.
> Here
> I'll go over the difference and I want you to comment on these.
>
> 1. I assume the term "whitespace character" means the same as a "whitespace
> byte" under
> the MIME Sniffing spec.  As such the use of that term is inadequate for the
> following reasons.
>
>   * A whitespace character includes 0x0C, form feed (FF), which is not
> considered whitespace
>      in either HTTP or the Internet Message Format (IMF, RFC5322).
>
>      For example, the following would not be well-formed under HTTP or IMF:
>
>      text/plain{FF}; charset=utf-8
>
>      But the current algorithm would consider that string well-formed
> anyway.
>
>   * All steps in the document that are the same as step 7 skip all
> whitespace characters, even
>      if the whitespace isn't well formed under HTTP or IMF.  For example, a
> bare carriage
>      return (CR) or line feed character (LF) is not allowed, and a CR-LF
> pair not followed by either
>      SPACE or TAB is also not allowed. IMF also allows comments within
> whitespace.
>
>      For example, the following would not be well-formed under HTTP or IMF:
>
>      text/plain;{CR} charset=utf-8
>      text/plain;{LF} charset=utf-8
>      text/plain;{CR}{LF}charset=utf-8
>
>      (Note the lack of space in the last example. Note also that folding
> whitespace is deprecated
>      under the current HTTP draft.)
>
>      And the following examples would be allowed under IMF, but not HTTP:
>
>      (comment) text/plain; charset=utf-8
>      text/plain; (comment) charset=utf-8
>      text/plain; (comment (nested)) charset=utf-8
>      text/plain; charset=utf-8 (comment)
>      text/plain; {CR}{LF} (comment) charset=utf-8
>
> 2. While the type, subtype, and parameter name are checked for their length,
> the other rules
>  for wellformedness are not checked in your version, namely, that they must
> not be empty,
>  contain a byte that isn't a MIME type byte (see my original message), or
> begin with a byte that
>  isn't an ASCII alphanumeric.
>
>  For example, the following would not be well-formed under RFC6838:
>
>  te*xt/plain;charset=utf-8
>  text/pl*ain;charset=utf-8
>  text/plain;ch*arset=utf-8
>  text/plain;=utf-8
>  text/;charset=utf-8
>  /plain;charset=utf-8
>
>  The first three examples are because "*" isn't a MIME type byte.
>
>
> 3. Unquoted parameter values are not checked to ensure that they are not
> empty and do
>  not contain a byte that isn't a parameter value byte (see my original
> message).
>
>  For example, the following would not be well-formed under HTTP or MIME:
>
>  text/plain;charset=ut?f-8
>  text/plain;charset=utf=8
>
> 4. Quoted parameter values are not checked to ensure that they do not
> contain a 0x7F byte
>  or a byte other than TAB (0x09) that is less than 0x20.
>
>  For example, the following would not be well-formed under HTTP or MIME:
>
>  text/plain;charset="utf{LF}-8"
>  text/plain;charset="utf{0x7F}-8"
>  text/plain;charset="utf\{LF}-8"
>  text/plain;charset="utf\{0x7F}-8"
>
> Please give your comments.
>
> --Peter
>
>
> -----Original Message----- From: Gordon P. Hemsley
> Sent: Saturday, May 25, 2013 1:26 PM
>
> To: Peter Occil
> Cc: WHATWG
> Subject: Re: [whatwg] [mimesniff] Complete MIME type parsing algorithm for
> section 5
>
> On Sat, May 25, 2013 at 12:46 PM, Peter Occil <poccil14@gmail.com> wrote:
>>
>> My algorithm skips only SPACE and TAB instead of all whitespace characters
>> because it assumes that the field value was already extracted from
>> Content-Type according to the HTTP/HTTPbis spec (0x0C, form feed, is never
>> considered whitespace in HTTP headers). In particular, it assumes that
>> folding whitespace (obs-fold) was replaced with spaces (or the message
>> with
>> obs-fold rejected) before the Content-Type value was interpreted.
>
>
> Thanks for your detailed explanation.
>
> It'll take me a little while to evaluate what you've proposed here,
> but in the meantime: Keep in mind that the Content-Type header is not
> the only source for a MIME type. This algorithm needs to consider MIME
> types from all possible sources.
>
> --
> Gordon P. Hemsley
> me@gphemsley.org
> http://gphemsley.org/ • http://gphemsley.org/blog/



-- 
Gordon P. Hemsley
me@gphemsley.org
http://gphemsley.org/ • http://gphemsley.org/blog/
Received on Tuesday, 28 May 2013 21:22:24 UTC