RE: [widgets] Potential bug in Rule for Identifying the Media Type of a File from Marcin Hanclik on 2009-10-16 (public-webapps@w3.org from October to December 2009)

From: Marcin Hanclik <Marcin.Hanclik@access-company.com>
Date: Fri, 16 Oct 2009 12:06:57 +0200
To: "marcosc@opera.com" <marcosc@opera.com>
CC: public-webapps <public-webapps@w3.org>
Message-ID: <FAA1D89C5BAF1142A74AF116630A9F2C2890D48A62@OBEEX01.obe.access-company.com>
Hi Marcos,

These are my remarks as discussed yesterday on the call.

Comment a)

6.A.If all characters in the extension are outside the two ranges, then go to step 5 in this algorithm.

Should be

6.A.If any of the characters in the extension is outside the two ranges, then go to step 5 in this algorithm.

But this is also problematic since it infinitely loops the algorithm in this given case.
So it should be:

6.A.If any of the characters in the extension is outside the two ranges, then go to step 7 in this algorithm.

Another comment to 6.A:
It seems that the whole algorithm assumes that the File Identification Table is constant.
E.g. if any vendor would like to add some extension with a character outside of the given ranges (or we in W3C would like to do this in the future), then we would need to rewrite the algorithm.

So what about this (we do not need the ranges IMHO):
6.  Attempt to case-insensitively match the value of extension to one of the values in the file extension column in the file identification table. If there is a match, then return the corresponding value from the media type column and terminate this algorithm.
And remove 6.A and 6.B as they were.

*****************
Comment b)

4. If the first character of the name is a U+002E 'FULL STOP' character, and the file name contains no other U+002E 'FULL STOP' character then go to step 7 of this algorithm.

What about ".jpg"?
Do you assume that this is filename and not file extension?

What about this:
4. If the first character of the name is a U+002E 'FULL STOP' character, and the file name contains no other U+002E 'FULL STOP' character then let extension be name and go to step 6 of this algorithm.

*****************
Comment c)

Given that the processing model is developed in prose, I think we MUST fix the ambiguity of the grammar anyway.
Thus I suggest the following change from:

file-name      = base-name [ file-extension ]
base-name      = 1*allowed-char
file-extension = "." 1*allowed-char

to:

file-name      = 1*allowed-char

(i.e. remove base-name and file-extension).
The removal of ambiguity is motivated by the dependency of the WURI/WUS spec on P&C in this particular detail, so it is better to keep it right, I think.
File extension does not play any role in WURI/WUS anyway.
I think either the above change or the one in my mail below has to be implemented in the spec.

*****************
Comment d)

We need to somehow derive the extension if the grammar is modified as in comment c) [i.e. removal of two rules].
Therefore I suggest the change from:

3. If the first character of the name is not a U+002E 'FULL STOP' character and the name has a file-extension  component, let extension be value of the file-extension component.

To:

3. Let "extension" be an empty string. If the first character of the name is not a U+002E 'FULL STOP' character and the file name contains U+002E 'FULL STOP' character, then let extension be the sequence of characters from the last U+002E 'FULL STOP' (inclusive) to the end of name and go to step 6 of this algorithm (as proposed in comment a) [no ranges etc.]).

SUMMARY
   1.

      Let file be the file to be processed.
   2.

      Let name be the file-name string component of the zip relative path that identifies the file.
   3.

      Let extension be an empty string. If the first character of the name is not a U+002E 'FULL STOP' character and the file name contains U+002E 'FULL STOP' character, then let extension be the sequence of characters from the last U+002E 'FULL STOP' (inclusive) to the end of name and go to step 6 of this algorithm

      For example, the extension of the file name "cat.html" would be ".html".
   4.

      If the first character of the name is a U+002E 'FULL STOP' character, and the file name contains no other U+002E 'FULL STOP' character then let extension be name and go to step 6 of this algorithm.

      REMOVE For example, if the name is ".htaccess", jump to step 7 and derive the mime type using the [SNIFF] specification.
        ADD For example, if the name is ".jpg", jump to step 6 and match image/jpeg.
   5.

      If the first character of the name is a U+002E 'FULL STOP' character, and the file name contains another U+002E 'FULL STOP' character, then let extension be the sequence of characters from the last U+002E 'FULL STOP' (inclusive) to the end of name.

      For example, if the name is ".myhidden.html", then the extension would be ".html".
   6.

      Attempt to case-insensitively match the value of extension to one of the values in the file extension column in the file identification table. If there is a match, then return the corresponding value from the media type column and terminate this algorithm.

   7. Return the result of processing file through the [SNIFF] specification.

Thanks,
Marcin

Marcin Hanclik
ACCESS Systems Germany GmbH
Tel: +49-208-8290-6452  |  Fax: +49-208-8290-6465
Mobile: +49-163-8290-646
E-Mail: marcin.hanclik@access-company.com

-----Original Message-----
From: marcosscaceres@gmail.com [mailto:marcosscaceres@gmail.com] On Behalf Of Marcos Caceres
Sent: Monday, October 12, 2009 10:36 PM
To: Marcin Hanclik
Cc: public-webapps
Subject: Re: [widgets] Potential bug in Rule for Identifying the Media Type of a File

>
>>>2. If file has a file-extension, attempt to match the file-extension
>>>to one in the file extensions column in the file identification table.
>>>If there is a match, then return the media type value. (returns
>>>"image/jpeg")
> I think file-extension would not be matched, but only base-name.
>
> I think the grammar is not ambiguous with regard to which rules would be matched.
> The problem is that at present in case of .jpg, there would be no file extension.
> A greedy parser would only match base-name and leave file-extension empty, since it is optional.
> So we need to modify the grammar to clearly specify what the extension is.
> With the current grammar, there is also a problem that "." is also allowed in the file-extension as part of the allowed-char.
> Therefore any parser may be confused which dot is the "." from the file-extension rule (I am not sure whether a parser can be developed at all).
> And thus, file-extension has problems. I assume that file extensions do not have dots, dot is to be the delimiter.
>
> What about modifying the ABNF to:
>
> file-name                 = file-name-with-extension | file-name-no-extension
>
> file-name-with-extension  = base-name file-extension
>
> base-name                 = *allowed-char
>
> file-extension            = "." 1*allowed-char-no-dot
>
> allowed-char-no-dot       = safe-char-no-dot / utf8-char
>
> safe-char-no-dot          = ALPHA / DIGIT / SP / "$" / "%"
>                           / "'" / "-" / "_" / "@"
>                           / "~" / "(" / ")" / "&" / "+"
>                           / "," / "." / "=" / "[" / "]"
>
> file-name-no-extension    = base-name-no-ext
>
> base-name-no-ext          = 1*allowed-char-no-dot
>
> This would make the base-name optional.
> .jpg is a valid file name, specifically on Linux platforms.
> Then, .jpg would have (only) a file extension and probably the prose of P&C would not need to be changed.
>

As part of this discussion I spend some time fine tuning the ABNF. I
merged in all the external refs and pumped out a few thousand test
cases for analysis using abnfgen [1]. Works great in MacOS X. I also
updated the spec to cover the following use cases [3]:

1. "noextension" > send to [SNIFF] spec.
2. "some.ext" > try to recognize extension. If fail, send to [SNIFF] spec.
3. ".something" > send to SNIFF spec.
4.  ".something.ext" > try to recognize extension. If fail, send to SNIFF spec.

New ABNF:

Zip-rel-path   = [locale-folder] [*folder-name] file-name/
                        [locale-folder] 1*folder-name
locale-folder  = %x6C %x6F %x63 %x61 %x6C %x65 %x73
                        "/" language-range "/"
folder-name    = file-name "/"
file-name      = base-name [ file-extension ]
base-name      = 1*allowed-char
file-extension = "." 1*allowed-char
allowed-char   = safe-char / zip-UTF8-char
zip-UTF8-char  = UTF8-2 / UTF8-3 / UTF8-4
safe-char      = ALPHA / DIGIT / SP / "$" / "%"
                                        / "'" / "-" / "_" / "@"
                                        / "~" / "(" / ")" / "&" / "+"
                                        / "," / "=" / "[" / "]" / "."
UTF8-2         = %xC2-DF UTF8-tail
UTF8-3         = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4         = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                 %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail      = %x80-BF
language-range = (1*8low-alpha / "*") *("-" (1*8alphanum / "*"))
alphanum       = low-alpha  / DIGIT
low-alpha      = %x61-71

[1] http://www.quut.com/abnfgen/

(using abnfgen path.abnf | xargs  mkdir -p )

[SNIFF]
http://tools.ietf.org/html/draft-abarth-mime-sniff-03


[3]
http://dev.w3.org/2006/waf/widgets/Overview_TSE.html#default-icons-table

--
Marcos Caceres
http://datadriven.com.au


________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.
Received on Friday, 16 October 2009 10:07:58 UTC