RE: [widgets] Potential bug in Rule for Identifying the Media Type of a File

Hi Marcos,

>>To be clear: All we want to do is check if the file extension of a
>>file case-insensitively matches one of the extensions in the File
>>Identification Table. If you can't match it, then the MIME type gets
>>resolved with SNIFF.
Ok, I understand the intention of this section.

The ranges are an implementation detail (optimization/efficiency of some implementation, not a MUST for all).
So in general all the comments about Unicode comparison/difficulty etc are irrelevant.
Thus ranges as well.

Then the only really disputable thing is whether ".jpg" should be sniffed (your proposal) or whether it is to be interpreted as pure file extension (my proposal).
In my argumentation I showed that on *nix/*inux systems ".jpg" is a file extension to support the interpretation as pure file extension.

The suggestion to remove ranges aims at facilitating any extensions/additions to the spec. E.g. if we would like to add ".p12" or Unicode extension to the File Identification Table, we should only have to add it there and not change the processing algorithm.

Thanks,
Marcin


Marcin Hanclik
ACCESS Systems Germany GmbH
Tel: +49-208-8290-6452  |  Fax: +49-208-8290-6465
Mobile: +49-163-8290-646
E-Mail: marcin.hanclik@access-company.com

-----Original Message-----
From: marcosscaceres@gmail.com [mailto:marcosscaceres@gmail.com] On Behalf Of Marcos Caceres
Sent: Thursday, October 22, 2009 5:25 PM
To: Marcin Hanclik
Cc: public-webapps
Subject: Re: [widgets] Potential bug in Rule for Identifying the Media Type of a File

2009/10/22 Marcin Hanclik <Marcin.Hanclik@access-company.com>:
> Hi Marcos, All,
>
>>>If any character in the extension is outside the U+0041-U+005A range
>>>and the U+0061-U+007A range, then go to step 7 in this algorithm.
> Unfortunately I disagree with that.
> Motivation:
> a) only ASCII characters are listed
> b) no digits are listed. What about file extensions that include digits, like e.g. .p12 (PKCS#12 certificate)?

I don't see that file format in the  "File Identification Table".

> c) at present internationalization is a key topic in many circles and I do not understand why we shall restrict the file extensions in XXI century.
>

Because we are trying to find stuff in the "File Identification Table"
(i.e., the algorithm is limited just to those file names). We are not
writing a general algorithm for extension to MIME mapping! That's what
SNIFF does.

> d) there exist proprietary widget specifications and it seems none of them restricts the file extensions.

I don't know what you mean here? We don't restrict anything. We have
the most common types defined, and the ones we don't defined are
handled by SNIFF. I don't see the problem?

> Proposed actions:
> Drop ranges and limits.
> Eventually also contact I18N group and ask their opinion.

I think you've misunderstood the intention of the specification wrt
this section.

>>>That is not possible because trying to do Unicode case comparisons is
>>>a nightmare (or so I'm told).
> I think we should distinguish between possibility and difficulty.

this is totally irrelevant for this algorithm?

> The whole filenames are to be compared (as per P&C) in many cases, and suddenly file extensions cannot be compared.
>

This is just for efficiency.

> E.g.
> "A default start file is a reserved start file at the root of the widget package or at the root of a locale folder whose file name case-sensitively and exactly matches a file name given in the file name column of the default start files table, and whose media type matches the media type given in the media type column of the table."
>
>>>That is correct. This behavior is *nix systems (including Mac OS X).
>>>This is not consistent with the behavior of the operating systems I
>>>have tested.
> I disagree.
> Could you please publish your tests?

I created the files in the finder on MacOs X (Snow Leopard). I prefer
not to send a screenshot to the mailing list.

> In general I think that there is no standard for the term "file extension". P&C actually standardizes it, it seems.
> In the *nix, *inux systems it seems not to exist, it can only be somehow artificially handled by some application (shell etc., see below).
> Here is mine test (executed on Ubuntu and Debian):
> host:~$ mkdir test
> host:~$ touch test/.jpg
> host:~$ touch test/img.jpg
> host:~$ touch test/.gif
> host:~$ touch test/img.gif
> host:~$ ls -laX test/
> total 8
> drwxr-xr-x 2 user user 4096 2009-10-22 15:33 .
> drwxr-xr-x 5 user user 4096 2009-10-22 15:33 ..
> -rw-r--r-- 1 user user    0 2009-10-22 15:33 .gif
> -rw-r--r-- 1 user user    0 2009-10-22 15:33 img.gif
> -rw-r--r-- 1 user user    0 2009-10-22 15:33 img.jpg
> -rw-r--r-- 1 user user    0 2009-10-22 15:33 .jpg
> //It seems that shell is confused, or?
> host:~$ cd test/
> host:~/test$ ls -laX
> total 8
> drwxr-xr-x 2 user user 4096 2009-10-22 15:33 .
> drwxr-xr-x 5 user user 4096 2009-10-22 15:33 ..
> -rw-r--r-- 1 user user    0 2009-10-22 15:33 .gif
> -rw-r--r-- 1 user user    0 2009-10-22 15:33 img.gif
> -rw-r--r-- 1 user user    0 2009-10-22 15:33 img.jpg
> -rw-r--r-- 1 user user    0 2009-10-22 15:33 .jpg
> //It seems that shell is confused, or?
> host:~/test$ basename .jpg
> .jpg
> host:~/test$ cd ..
> host:~$ basename test/.jpg
> .jpg
> host:~$ basename test/.jpg .jpg
> .jpg
> host:~$ basename test/img.jpg .jpg
> img
> host:~$ basename test/img.jpg
> img.jpg
> host:~$ basename test/img.jpg pg
> img.j
> //this test actually proves that the basename app is looking for the [SUFFIX] string in the file name. File extension is ARTIFICIAL!!
>

We know this already, Basename does not exist in the spec anymore? you
made me take it out? That's why we have the prose.

> host:~$
>
> Further comments:
> [1] gives the following guidelines for media type registration:
> "Various sorts of optional information SHOULD be included in the
> specification of a media type if it is available:
> ...
>   o  File name extension(s) commonly used on one or more platforms to
>      indicate that some file contains a given media type.
>
>   o  Mac OS File Type code(s) (4 octets) used to label files containing
>      a given media type."
> The term file (name) extension is not defined. MacOS File Type code seems not to be equivalent to file extension (that stems more from Windows world).
>

is this even relevant now? Or is this some legacy thing for previous
version of Mac Os?

> Historically Windows worked with 3 characters and Mac with 4 characters.
>
> Therefore in P&C we shall assume that file extension is just any sequence of characters that occur after the last dot (U+002E FULL STOP) including that dot.
>

I really don't understand what you are intending to solve or what you
think the spec does here?

To be clear: All we want to do is check if the file extension of a
file case-insensitively matches one of the extensions in the File
Identification Table. If you can't match it, then the MIME type gets
resolved with SNIFF.




--
Marcos Caceres
http://datadriven.com.au


________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.

Received on Thursday, 22 October 2009 15:46:30 UTC