Re: [whatwg] How to determine content-type of file: protocol from Gordon P. Hemsley on 2014-07-28 (public-whatwg-archive@w3.org from July 2014)

From: Gordon P. Hemsley <me@gphemsley.org>
Date: Mon, 28 Jul 2014 10:08:00 -0400
To: duanyao <duanyao@ustc.edu>, whatwg@whatwg.org
Message-ID: <53D65940.9020905@gphemsley.org>
On 07/28/2014 08:01 AM, duanyao wrote:
> On 07/28/2014 06:34, Gordon P. Hemsley wrote:
>> Sorry for the delay in responding. Your message fell through the
>> cracks in my e-mail filters.
>>
>> On 07/17/2014 08:26 AM, duanyao wrote:
>>> Hi,
>>>
>>> My first question is about a rule in MIME Sniffing specification
>>> (http://mimesniff.spec.whatwg.org):
>>>
>>>     5.1 Interpreting the resource metadata
>>>     ...
>>>     If the resource is retrieved directly from the file system, set
>>> supplied-type to the MIME type
>>>     provided by the file system.
>>>
>>> As far as I know, no main-stream file systems record MIME type for
>>> files. Does the spec actually want to say "provided by the operating
>>> system" or
>>> "provided by the file name extension"?
>>
>> Yeah, you've hit a known (though apparently unrecorded) bug in the
>> spec, originally pointed out to me by Boris Zbarsky via IRC many
>> months ago. The intent here is basically just "whatever the computer
>> says it is"—whether that be via the file system, the operating system,
>> or whatever, and whether it uses magic bytes, file extensions, or
>> whatever.
>>
>> In other words, feel free to read that as "the correct behavior is
>> undefined/unknown" at this point.
> Thanks for the explanation.
>
> Recently, file: protocol becomes more and more important due to the
> popularity of packaged web applications, including PhoneGap app, Chrome
> app, Firefox OS app, Window 8 HTML app, etc (not all of them use file:
> protocol directly, but underlying mechanisms are similar).
> So If we can't specify a interoperable way to determine a local file's
> mime type, porting of packaged web applications can be problematic in
> some situations (actually my team already hit this).
>
> I know that currently there is no standard way to determine a local
> file's mime type, this may be one of the reason that mimesniff spec has
> not defined a behavior here.

Well, the most basic reason is because I never delved into how it 
actually works, because I was primarily concerned with HTTP connections.

It's possible that there is no interoperable way to determine a local 
file's MIME type, but see below.

> I'd like to propose a simple way to resolve this problem:
> For mime types that has already been standardized by IANA and used in
> web standards, determine a local file's supplied-type according to its
> file extension.
> This list could include htm, html, xhtml, xml, svg, css, js, ipeg, ipg,
> png, mp4, webm, woff, etc. Otherwise, UAs can determine supplied-type by
> any means.
>
> I think this rule should resolve most of the interoperability problems,
> and largely maintain compatibility with current UAs' implementations.

There is already a "standard" in place to detect file types on the 
operating system level:

http://www.freedesktop.org/wiki/Specifications/shared-mime-info-spec/
http://cgit.freedesktop.org/xdg/shared-mime-info/

I could just refer to that and be done with it. Do you think that would 
work? (That specification has complex rules for detecting files, 
including magic bytes and whatnot, and is already used on a number of 
Linux distros and probably other operating systems.)

>>> My second question is: does above rule apply equally to both fetching
>>> static resources (top level, iframe, img, etc) and XMLHttpRequest?
>>>
>>> It seems all browsers try to figure out actual type for local static
>>> resources, so that .htm and .xhtml files are rendered as HTML and
>>> XHTML respectively,
>>> so far so good.
>>>
>>> But when it comes to XHR, things are different.
>>>
>>> Firefox(31) set Content-Type header to 'application/xml' for local
>>> files of any type; and if setting xhr.responseType = 'document',
>>> response is parsed as XML;
>>> also if setting xhr.responseType = 'blob', blob.type is always
>>> 'application/xml'. This is significantly diverse from static fetching
>>> behavior.
>>>
>>> Chromium(34) set Content-Type header to null for local files of any
>>> type; but if setting xhr.responseType = 'document', response is
>>> parsed according to its actual type,
>>> i.e. .htm as HTML and .xhtml as XHTML; and if setting
>>> xhr.responseType = 'blob', blob.type is the file's actual type, i.e.
>>> 'text/html' for .htm and 'application/xhtml+xml'
>>> for .xhtml. This is similar to static fetching behavior, however
>>> Content-Type header is missing.
>>>
>>> I think rule 5.1 should be applied to both static fetching and XHR
>>> consistently. Browsers should set Content-Type header to local files'
>>> actual type for XHR, and interpret
>>> them accordingly. But firefox developers think this would break some
>>> existing codes that already rely on firefox's behavior
>>> (see https://bugzilla.mozilla.org/show_bug.cgi?id=1037762).
>>>
>>> What do you think?
>>>
>>> Regards,
>>>      Duan Yao.
>>>
>>>
>>
>> Anne's the person to ask about XHR first, I think. I don't want to
>> make any judgements or claims until I hear his view on the situation.
>>
>> That being said, I created the Contexts wiki article [1] and began
>> splitting up the mimesniff spec according to contexts [2] in an effort
>> to clarify this situation and make sure that all bases were covered.
>> It's still a work in progress, awaiting feedback from implementers and
>> other spec writers.
>>
>> I agree that there's a hole in how mimesniff, XHR, and Contexts
>> intersect, and I'll be happy to update mimesniff to fill it, if that's
>> determined to be the best course of action.
>>
>> HTH,
>> Gordon
>>
>> [1] http://wiki.whatwg.org/wiki/Contexts
>> [2] http://mimesniff.spec.whatwg.org/#context-specific-sniffing
>>
> I note that in the Contexts wiki article, "connection" context (which
> XHR belongs to) has no sniffing algorithm specified.
> Does this mean UA should not sniff in case of XHR, or just mean the
> algorithm has not been specified yet?

Blank means not yet specified/determined. There should always be an 
algorithm for sniffing in a particular context, even if there's only a 
single step for that context. However, if there indeed were none, the 
cell would have an em dash (—).

> Personally I'd like to have "connection" context use same algorithm as
> "browsing" context, because client js codes aren't always
> sure about the mime types sent via XHR, much like "browsing" context.

That's probably something to discuss with Anne and/or Ian.

Regards,
Gordon

-- 
Gordon P. Hemsley
me@gphemsley.org
http://gphemsley.org/
Received on Monday, 28 July 2014 14:08:26 UTC