Re: [whatwg] How to determine content-type of file: protocol from 段垚 on 2014-08-15 (public-whatwg-archive@w3.org from August 2014)

From: 段垚 <duanyao@ustc.edu>
Date: Fri, 15 Aug 2014 17:36:35 +0800
To: Nils Dagsson Moskopp <nils@dieweltistgarnichtso.net>, "Gordon P. Hemsley" <me@gphemsley.org>, whatwg@whatwg.org
Message-ID: <53EDD4A3.8040906@ustc.edu>
于 2014/8/14 21:23, Nils Dagsson Moskopp 写道:
> duanyao <duanyao@ustc.edu> writes:
>
>> On 07/28/2014 22:08, Gordon P. Hemsley wrote:
>>> On 07/28/2014 08:01 AM, duanyao wrote:
>>>> On 07/28/2014 06:34, Gordon P. Hemsley wrote:
>>>>> Sorry for the delay in responding. Your message fell through the
>>>>> cracks in my e-mail filters.
>>>>>
>>>>> On 07/17/2014 08:26 AM, duanyao wrote:
>>>>>> Hi,
>>>>>>
>>>>>> My first question is about a rule in MIME Sniffing specification
>>>>>> (http://mimesniff.spec.whatwg.org):
>>>>>>
>>>>>>      5.1 Interpreting the resource metadata
>>>>>>      ...
>>>>>>      If the resource is retrieved directly from the file system, set
>>>>>> supplied-type to the MIME type
>>>>>>      provided by the file system.
>>>>>>
>>>>>> As far as I know, no main-stream file systems record MIME type for
>>>>>> files. Does the spec actually want to say "provided by the operating
>>>>>> system" or
>>>>>> "provided by the file name extension"?
>>>>> Yeah, you've hit a known (though apparently unrecorded) bug in the
>>>>> spec, originally pointed out to me by Boris Zbarsky via IRC many
>>>>> months ago. The intent here is basically just "whatever the computer
>>>>> says it is"—whether that be via the file system, the operating system,
>>>>> or whatever, and whether it uses magic bytes, file extensions, or
>>>>> whatever.
>>>>>
>>>>> In other words, feel free to read that as "the correct behavior is
>>>>> undefined/unknown" at this point.
>>>> Thanks for the explanation.
>>>>
>>>> Recently, file: protocol becomes more and more important due to the
>>>> popularity of packaged web applications, including PhoneGap app, Chrome
>>>> app, Firefox OS app, Window 8 HTML app, etc (not all of them use file:
>>>> protocol directly, but underlying mechanisms are similar).
>>>> So If we can't specify a interoperable way to determine a local file's
>>>> mime type, porting of packaged web applications can be problematic in
>>>> some situations (actually my team already hit this).
>>>>
>>>> I know that currently there is no standard way to determine a local
>>>> file's mime type, this may be one of the reason that mimesniff spec has
>>>> not defined a behavior here.
>>> Well, the most basic reason is because I never delved into how it
>>> actually works, because I was primarily concerned with HTTP connections.
>>>
>>> It's possible that there is no interoperable way to determine a local
>>> file's MIME type, but see below.
>>>
>>>> I'd like to propose a simple way to resolve this problem:
>>>> For mime types that has already been standardized by IANA and used in
>>>> web standards, determine a local file's supplied-type according to its
>>>> file extension.
>>>> This list could include htm, html, xhtml, xml, svg, css, js, ipeg, ipg,
>>>> png, mp4, webm, woff, etc. Otherwise, UAs can determine supplied-type by
>>>> any means.
>>>>
>>>> I think this rule should resolve most of the interoperability problems,
>>>> and largely maintain compatibility with current UAs' implementations.
>>> There is already a "standard" in place to detect file types on the
>>> operating system level:
>>>
>>> http://www.freedesktop.org/wiki/Specifications/shared-mime-info-spec/
>>> http://cgit.freedesktop.org/xdg/shared-mime-info/
>>>
>>> I could just refer to that and be done with it. Do you think that
>>> would work? (That specification has complex rules for detecting files,
>>> including magic bytes and whatnot, and is already used on a number of
>>> Linux distros and probably other operating systems.)
>>>
>> Maybe no.
>> (1) it's a standard of *nix desktops, I doubt MS widows will adopt it,
> I see this as pure speculation.
MS Windows never have a similar mechanism like freedesktop.
It can only determine mime-type from filename extension, not file 
content; and the mapping between extension and type
is not even shipped with Windows itself -- it relies on installed 
applications to register extensions and mime-types.
See 
http://stackoverflow.com/questions/3442607/mime-types-in-the-windows-registry 
.

Do you have any clue that Windows will change this in near furture?
>> and maybe it's a bit heavy for mobile OS;
> Widely used mobile operating systems are based on Unix (e.g. iOS,
> Android). Based on your measurements, how long does file(1) take?
Android does have a mime-type database and can guess mime-type from both 
extension and content, i.e.
  java.net.URLConnection.guessContentTypeFromName(String filename)
  java.net.URLConnection.guessContentTypeFromStream(java.io.InputStream in)

However, iOS doesn't have such things, and can only guess from 
extension. See
http://stackoverflow.com/questions/1363813/how-can-you-read-a-files-mime-type-in-objective-c

Not to mention windows phone.

Sniffing mime-type from file content and using mime-type database is 
always much slower than guessing from extension,
because much more data are required to read from disk, and much more CPU 
cycles are needed to analyze these data.
This is why web servers only guess types from extensions.

Also because browsers already implemented mime-type sniffing, it's a 
waste to do it twice.

However, if most mobile OSs would ship with mime-type database in future 
and browsers are willing to use it, I'm OK.

>> (2) many packaged web apps are ported from (and share codes with) normal
>> web apps, and most web servers simply deduce mime type from file extension,
>> so doing the same thing in UAs probably results in better
>> compatibility.
> It may not be possible to deduce the media type from the file extension
> alone, since there can be parameters to the media type like “charset” or
> “codecs”, e.g. “text/html; charset=UTF-8” or “audio/ogg; codecs=vorbis”.
Sure. But there are already mechanisms in html and xml to describe its 
encoding (<meta charset=...>, <?xml encoding=...?>),
and video/audio files also contains its codecs' info. If you want do 
extract codecs info from video/audio, maybe we need another web API.

For plain text files (and html that missing encoding info), there is no 
standard and reliable way to determine its encoding. Browsers and text 
editors usually equipped with various algorithms to sniff encodings, and 
also let users choose encodings. So I think it is not the time 
standardrize a encoding sniffing algorithm.
>> (3) UAs are already required to do mime type sniffing, which should be
>> enough to correct most wrong supplied-type.
> Is this interoperable enough yet for the purpose at hand?
>
I think so, mime type sniffing has been more or less implemented by all 
major browsers, and becomes a standard, see

http://www.adambarth.com/papers/2009/barth-caballero-song.pdf
http://mimesniff.spec.whatwg.org
Received on Friday, 15 August 2014 09:37:39 UTC