RE: [widgets] Potential bug in Rule for Identifying the Media Type of a File from Marcin Hanclik on 2009-10-22 (public-webapps@w3.org from October to December 2009)

From: Marcin Hanclik <Marcin.Hanclik@access-company.com>
Date: Thu, 22 Oct 2009 16:13:27 +0200
To: Marcos Caceres <marcosc@opera.com>
CC: public-webapps <public-webapps@w3.org>
Message-ID: <FAA1D89C5BAF1142A74AF116630A9F2C2890D48C7F@OBEEX01.obe.access-company.com>
Hi Marcos, All,

>>If any character in the extension is outside the U+0041-U+005A range
>>and the U+0061-U+007A range, then go to step 7 in this algorithm.
Unfortunately I disagree with that.
Motivation:
a) only ASCII characters are listed
b) no digits are listed. What about file extensions that include digits, like e.g. .p12 (PKCS#12 certificate)?
c) at present internationalization is a key topic in many circles and I do not understand why we shall restrict the file extensions in XXI century.
d) there exist proprietary widget specifications and it seems none of them restricts the file extensions.

Proposed actions:
Drop ranges and limits.
Eventually also contact I18N group and ask their opinion.

>>That is not possible because trying to do Unicode case comparisons is
>>a nightmare (or so I'm told).
I think we should distinguish between possibility and difficulty.
The whole filenames are to be compared (as per P&C) in many cases, and suddenly file extensions cannot be compared.
E.g.
"A default start file is a reserved start file at the root of the widget package or at the root of a locale folder whose file name case-sensitively and exactly matches a file name given in the file name column of the default start files table, and whose media type matches the media type given in the media type column of the table."

>>That is correct. This behavior is *nix systems (including Mac OS X).
>>This is not consistent with the behavior of the operating systems I
>>have tested.
I disagree.
Could you please publish your tests?
In general I think that there is no standard for the term "file extension". P&C actually standardizes it, it seems.
In the *nix, *inux systems it seems not to exist, it can only be somehow artificially handled by some application (shell etc., see below).
Here is mine test (executed on Ubuntu and Debian):
host:~$ mkdir test
host:~$ touch test/.jpg
host:~$ touch test/img.jpg
host:~$ touch test/.gif
host:~$ touch test/img.gif
host:~$ ls -laX test/
total 8
drwxr-xr-x 2 user user 4096 2009-10-22 15:33 .
drwxr-xr-x 5 user user 4096 2009-10-22 15:33 ..
-rw-r--r-- 1 user user    0 2009-10-22 15:33 .gif
-rw-r--r-- 1 user user    0 2009-10-22 15:33 img.gif
-rw-r--r-- 1 user user    0 2009-10-22 15:33 img.jpg
-rw-r--r-- 1 user user    0 2009-10-22 15:33 .jpg
//It seems that shell is confused, or?
host:~$ cd test/
host:~/test$ ls -laX
total 8
drwxr-xr-x 2 user user 4096 2009-10-22 15:33 .
drwxr-xr-x 5 user user 4096 2009-10-22 15:33 ..
-rw-r--r-- 1 user user    0 2009-10-22 15:33 .gif
-rw-r--r-- 1 user user    0 2009-10-22 15:33 img.gif
-rw-r--r-- 1 user user    0 2009-10-22 15:33 img.jpg
-rw-r--r-- 1 user user    0 2009-10-22 15:33 .jpg
//It seems that shell is confused, or?
host:~/test$ basename .jpg
.jpg
host:~/test$ cd ..
host:~$ basename test/.jpg
.jpg
host:~$ basename test/.jpg .jpg
.jpg
host:~$ basename test/img.jpg .jpg
img
host:~$ basename test/img.jpg
img.jpg
host:~$ basename test/img.jpg pg
img.j
//this test actually proves that the basename app is looking for the [SUFFIX] string in the file name. File extension is ARTIFICIAL!!
host:~$

Further comments:
[1] gives the following guidelines for media type registration:
"Various sorts of optional information SHOULD be included in the
specification of a media type if it is available:
...
   o  File name extension(s) commonly used on one or more platforms to
      indicate that some file contains a given media type.

   o  Mac OS File Type code(s) (4 octets) used to label files containing
      a given media type."
The term file (name) extension is not defined. MacOS File Type code seems not to be equivalent to file extension (that stems more from Windows world).
Historically Windows worked with 3 characters and Mac with 4 characters.

Therefore in P&C we shall assume that file extension is just any sequence of characters that occur after the last dot (U+002E FULL STOP) including that dot.

Thanks,
Marcin

[1] http://tools.ietf.org/html/rfc4288#section-4.11


Marcin Hanclik
ACCESS Systems Germany GmbH
Tel: +49-208-8290-6452  |  Fax: +49-208-8290-6465
Mobile: +49-163-8290-646
E-Mail: marcin.hanclik@access-company.com

-----Original Message-----
From: marcosscaceres@gmail.com [mailto:marcosscaceres@gmail.com] On Behalf Of Marcos Caceres
Sent: Thursday, October 22, 2009 1:56 PM
To: Marcin Hanclik
Cc: public-webapps
Subject: Re: [widgets] Potential bug in Rule for Identifying the Media Type of a File

On Fri, Oct 16, 2009 at 12:06 PM, Marcin Hanclik
<Marcin.Hanclik@access-company.com> wrote:
> Hi Marcos,
>
> These are my remarks as discussed yesterday on the call.
>
> Comment a)
>
> 6.A.If all characters in the extension are outside the two ranges, then go to step 5 in this algorithm.
>
> Should be
>
> 6.A.If any of the characters in the extension is outside the two ranges, then go to step 5 in this algorithm.
>
> But this is also problematic since it infinitely loops the algorithm in this given case.
> So it should be:
>
> 6.A.If any of the characters in the extension is outside the two ranges, then go to step 7 in this algorithm.

I changed it to:

If any character in the extension is outside the U+0041-U+005A range
and the U+0061-U+007A range, then go to step 7 in this algorithm.


> Another comment to 6.A:
> It seems that the whole algorithm assumes that the File Identification Table is constant.
> E.g. if any vendor would like to add some extension with a character outside of the given ranges (or we in W3C would like to do this in the future), then we would need to rewrite the algorithm.
>

> So what about this (we do not need the ranges IMHO):
> 6.  Attempt to case-insensitively match the value of extension to one of the values in the file extension column in the file identification table. If there is a match, then return the corresponding value from the media type column and terminate this algorithm.
>

That is not possible because trying to do Unicode case comparisons is
a nightmare (or so I'm told).  This is why we restrict to just
checking for ASCII. I find it highly unlikely that we will see
standardized file extensions outside the ASCII range - none exists to
date and there is no evidence to suggest that they will exists in the
future.

> And remove 6.A and 6.B as they were.
>
> *****************
> Comment b)
>
> 4. If the first character of the name is a U+002E 'FULL STOP' character, and the file name contains no other U+002E 'FULL STOP' character then go to step 7 of this algorithm.
>
> What about ".jpg"?
> Do you assume that this is filename and not file extension?

That is correct. This behavior is *nix systems (including Mac OS X).

> What about this:
> 4. If the first character of the name is a U+002E 'FULL STOP' character, and the file name contains no other U+002E 'FULL STOP' character then let extension be name and go to step 6 of this algorithm.
>

This is not consistent with the behavior of the operating systems I
have tested.

> *****************
> Comment c)
>
> Given that the processing model is developed in prose, I think we MUST fix the ambiguity of the grammar anyway.
>
> Thus I suggest the following change from:
>
> file-name      = base-name [ file-extension ]
> base-name      = 1*allowed-char
> file-extension = "." 1*allowed-char
>
> to:
>
> file-name      = 1*allowed-char
>
> (i.e. remove base-name and file-extension).
>
> The removal of ambiguity is motivated by the dependency of the WURI/WUS spec on P&C in this particular detail, so it is better to keep it right, I think.
> File extension does not play any role in WURI/WUS anyway.
> I think either the above change or the one in my mail below has to be implemented in the spec.

Ok, removed it.


--
Marcos Caceres
http://datadriven.com.au


________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.
Received on Thursday, 22 October 2009 14:14:30 UTC