Re: [widgets] Potential bug in Rule for Identifying the Media Type of a File from Marcos Caceres on 2009-10-12 (public-webapps@w3.org from October to December 2009)

From: Marcos Caceres <marcosc@opera.com>
Date: Mon, 12 Oct 2009 22:35:30 +0200
To: Marcin Hanclik <Marcin.Hanclik@access-company.com>
Cc: public-webapps <public-webapps@w3.org>
Message-ID: <b21a10670910121335q622faa52u78f025933f4756e7@mail.gmail.com>

>
>>>2. If file has a file-extension, attempt to match the file-extension
>>>to one in the file extensions column in the file identification table.
>>>If there is a match, then return the media type value. (returns
>>>"image/jpeg")
> I think file-extension would not be matched, but only base-name.
>
> I think the grammar is not ambiguous with regard to which rules would be matched.
> The problem is that at present in case of .jpg, there would be no file extension.
> A greedy parser would only match base-name and leave file-extension empty, since it is optional.
> So we need to modify the grammar to clearly specify what the extension is.
> With the current grammar, there is also a problem that "." is also allowed in the file-extension as part of the allowed-char.
> Therefore any parser may be confused which dot is the "." from the file-extension rule (I am not sure whether a parser can be developed at all).
> And thus, file-extension has problems. I assume that file extensions do not have dots, dot is to be the delimiter.
>
> What about modifying the ABNF to:
>
> file-name                 = file-name-with-extension | file-name-no-extension
>
> file-name-with-extension  = base-name file-extension
>
> base-name                 = *allowed-char
>
> file-extension            = "." 1*allowed-char-no-dot
>
> allowed-char-no-dot       = safe-char-no-dot / utf8-char
>
> safe-char-no-dot          = ALPHA / DIGIT / SP / "$" / "%"
>                           / "'" / "-" / "_" / "@"
>                           / "~" / "(" / ")" / "&" / "+"
>                           / "," / "." / "=" / "[" / "]"
>
> file-name-no-extension    = base-name-no-ext
>
> base-name-no-ext          = 1*allowed-char-no-dot
>
> This would make the base-name optional.
> .jpg is a valid file name, specifically on Linux platforms.
> Then, .jpg would have (only) a file extension and probably the prose of P&C would not need to be changed.
>

As part of this discussion I spend some time fine tuning the ABNF. I
merged in all the external refs and pumped out a few thousand test
cases for analysis using abnfgen [1]. Works great in MacOS X. I also
updated the spec to cover the following use cases [3]:

1. "noextension" > send to [SNIFF] spec.
2. "some.ext" > try to recognize extension. If fail, send to [SNIFF] spec.
3. ".something" > send to SNIFF spec.
4.  ".something.ext" > try to recognize extension. If fail, send to SNIFF spec.

New ABNF:

Zip-rel-path   = [locale-folder] [*folder-name] file-name/
                        [locale-folder] 1*folder-name
locale-folder  = %x6C %x6F %x63 %x61 %x6C %x65 %x73
          "/" language-range "/"
folder-name    = file-name "/"
file-name      = base-name [ file-extension ]
base-name      = 1*allowed-char
file-extension = "." 1*allowed-char
allowed-char   = safe-char / zip-UTF8-char
zip-UTF8-char  = UTF8-2 / UTF8-3 / UTF8-4
safe-char      = ALPHA / DIGIT / SP / "$" / "%"
     / "'" / "-" / "_" / "@"
     / "~" / "(" / ")" / "&" / "+"
     / "," / "=" / "[" / "]" / "."
UTF8-2         = %xC2-DF UTF8-tail
UTF8-3         = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4         = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                 %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail      = %x80-BF
language-range = (1*8low-alpha / "*") *("-" (1*8alphanum / "*"))
alphanum       = low-alpha  / DIGIT
low-alpha      = %x61-71

[1] http://www.quut.com/abnfgen/
(using abnfgen path.abnf | xargs  mkdir -p )

[SNIFF]
http://tools.ietf.org/html/draft-abarth-mime-sniff-03

[3]
http://dev.w3.org/2006/waf/widgets/Overview_TSE.html#default-icons-table
-- 
Marcos Caceres
http://datadriven.com.au

Received on Monday, 12 October 2009 20:36:20 UTC