Re: Content-Disposition next steps from Maciej Stachowiak on 2010-12-13 (ietf-http-wg@w3.org from October to December 2010)

From: Maciej Stachowiak <mjs@apple.com>
Date: Mon, 13 Dec 2010 01:06:13 -0800
To: Adam Barth <ietf@adambarth.com>
Cc: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-id: <8E2A2EFA-631D-474D-BC56-9524FAA8730B@apple.com>
Here are some comments from my colleague Alexey Proskuryakov on your proposal. I know these may have been outpaced by the considerable discussion since that point, but they still seem like they could be useful.

> I only know about file name decoding - all parsing is of course in CFNetwork, and most logic is in Launch Services, I think.
> 
> Adam's proposal is a step forward in that it acknowledges the need to process raw non-ASCII bytes in filename, which is the only encoding style that matters. He also describes the proper algorithm, acknowledging that Chrome doesn't fully implement it. Unsurprisingly, that part was met with resistance from the "we always told you it was ISO-8859-1" crowd.
> 
> I agree that RFC2047 style encoding shouldn't be supported, and I'm ambivalent about RFC5987. RFC2231/5987 is a step in the wrong direction (opaque encoding for something that doesn't need it), but given that IETF won't cease pushing it, we might as well implement it and be more compatible with Firefox, if not the Web.
> 
> - WBR, Alexey Proskuryakov




>> Do the details of filename-decoding seem reasonable? They don't seem to quite match what we do, in particular I don't think they have Latin1 as the final fallback.
>> 
>>  - Maciej

> Yes, besides the lack of latin-1 fallback, the described sequence matches ours.
> 
>     // Always try UTF-8. If that fails, try frame encoding (if any) and then the default.
>     // For a newly opened frame with an empty URL, encoding() should not be used, because this methods asks decoder, which uses ISO-8859-1.
>     Settings* settings = m_frame->settings();
>     request.setResponseContentDispositionEncodingFallbackArray("UTF-8", writer()->deprecatedFrameEncoding(), settings ? settings->defaultTextEncodingName() : String());
> 
> I guess I should have mentioned the implied latin-1 fallback in the comment...

> 
> - WBR, Alexey Proskuryakov


In case anyone is wondering, the reason we use Latin1 as the final fallback is that it never fails to decode, so its guaranteed to give something, if our other attempts failed.

Regards,
Maciej


On Dec 1, 2010, at 11:50 AM, Adam Barth wrote:

> On Wed, Dec 1, 2010 at 3:12 AM, Mark Nottingham <mnot@mnot.net> wrote:
>> Adam, do you have a proposal?
> 
> Yeah.  Please find my proposal below.  It's certainly not beautiful,
> and it likely needs more polish, but it should be a starting point.
> 
> I tried to be as "gramatical" as I could, but couldn't quite figure
> out how avoid all the algorithmic aspects.  The proposal is based on
> what Chrome does, but cleaned up slightly.  There's some sadness I
> couldn't quite figure out how to avoid, but I'm certainly open to
> talking about it more.
> 
> The rules for determining the disposition-type are particularly goofy.
> I wanted to do more homework to figure how if we can make those more
> aesthetic, but I ran out of time.
> 
> One of the ground rules was that my proposal should only differ from
> the current draft in error-handling cases.  I believe that's the case,
> but I'm not 100% sure.  Please let me know if I've screwed that up.
> 
> Adam
> 
> 
> == Extracting Parameter Values From Header Fields ==
> 
> To extract the value for a given parameter-name from an unparsed-string, parse
> the unparsed-string using the following grammar:
> 
>  unparsed-string = *CHAR name *LWS "=" value [ ";" *CHAR ]
>  value           = <CHAR, except ";">
> 
> where the name production is a gramatical production that is a case-insensitive
> match for the given parameter-name.  If the unparsed-string can be parsed by
> the grammar in multple ways, choose the one in which name appears as close to
> the beginning of the string as possible.  If the unparsed-string cannot be
> parsed by the grammar above, return the empty string.
> 
> 
> == Decoding the File Name ==
> 
> To filename-decode an encoded-string, parse the encoded-string using the
> following grammar:
> 
>  encoded-string = word *( 1*delimiter word )
>  delimiter      = LWS
>  word           = <CHAR, except delimiter>
> 
> Consider each gramatical element (either a delimiter or a word) in the order
> they appear in the encoded-string:
> 
>  1) If the gramatical element is a delimiter, process the element as follows:
> 
>       a) If the previous gramatical element was an RFC2047-value, ignore this
>          gramatical element.
> 
>       b) Otherwise, emit a SP character.
> 
>  2) If the gramatical element is a word, process the element as follows:
> 
>       a) If the word contains non-ASCII characters, process the element as
>          follows:
> 
>            i)  If the word is a well-formed UTF-8 string, emit the word
>                (decoded as UTF-8) and proceed to the next grammatical element.
> 
>            ii) Otherwise, *sadness*.  Apparently what we're supposed to do
>                here is to use the "referrer" charset, if we have one.
>                Otherwse, we fall back to the OS codepage.
> 
>        b) If the word is an RFC2047-value, emit the RFC2047 decoding of the
>           word and proceed to the next grammatical element.
> 
>        c) Let the url-unescaped-word be the word %-unescaped.
> 
>        d) Emit the url-unescaped-word (decoded as UTF-8) and proceed to the
>           next grammatical element.  (There's actually more sadness here if
>           the url-unescaped-word isn't valid UTF-8.)
> 
> The emitted characters are the decoded file name.
> 
> 
> == Determining the File Name ==
> 
> To determine the file name indicated by a Content-Disposition header field, use
> the following algorithm:
> 
>  1) Let filename-star be the value extracted from the Content-Disposition
>     header field for for the "filename*" parameter.
> 
>  2) If filename-star parses as a RFC5987-value, return the RFC5987-value of
>     filename-star and abort these steps.
> 
>  3) Let filename be the value extracted from the Content-Disposition header
>     field for the "filename" parameter.
> 
>  4) If filename is empty, instead let filename be the value extracted from the
>     Content-Disposition header field for the "name" parameter.
> 
>  5) If filename is empty, return the empty string and abort these steps.
> 
>  6) Return the filename-decoding of filename.
> 
> 
> == Determining the Disposition ==
> 
> To determine the disposition-type, parse the Content-Disposition
> header field using
> the following grammar:
> 
>  unparsed-string  = *LWS nominal-type *CHAR
>  nominal-type = "inline" / "filename" / "name" / ";"
> 
> If the Content-Disposition header field parser fails to parse, then the
> disposition type is "attachment".  Otherwise, the disposition-type is "inline".
> 
> 
> == Processing the Content-Disposition Header Field ==
> 
> To process the Content-Disposition header field, use the following algorithm:
> 
>  1) Determine the disposition-type.
> 
>  2) If the disposition-type is "inline", then ...
> 
>  3) If the disposition-type is "attachment", then let filename be the file
>     name indicated by the header field.  ...
>
Received on Monday, 13 December 2010 09:06:56 UTC