Re: Content-Disposition next steps from Julian Reschke on 2010-12-04 (ietf-http-wg@w3.org from October to December 2010)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sat, 04 Dec 2010 20:59:29 +0100
To: Adam Barth <ietf@adambarth.com>
CC: Bjoern Hoehrmann <derhoermi@gmx.net>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <4CFA9DA1.8010007@gmx.de>
On 04.12.2010 19:16, Adam Barth wrote:
> ...
>> The specs are clear on it, and all UAs except FF and Chrome get this right.
>> This seems to be a case where it's clear what's "better" (even if we may
>> disagree on the metrics for "better").
>
> I'm not sure it's so clear-cut.  I'll dig around a bit and see what
> more information I can find.
> ...

What more information do you need?

> ...
>>>> <http://greenbytes.de/tech/tc2231/#dispextbadfn>
>>>>
>>>> failing in Chrome only.
>>>
>>> Oh good.  I'll update the wiki.  With a more elaborate grammar.
>>
>> See also<http://code.google.com/p/chromium/issues/detail?id=65276>.
>
> Thanks for filing these bugs.  That's helpful.  My plan is to change
> Chrome to match the spec once the spec has settled down.
> ...

The applicable spec hasn't really changed since it was published many 
years ago. The Internet Draft we're discussing is just clarifying and 
profiling.

> ...
>> The specification says how to consume *valid* header instances (I know you
>> know that, but I'm repeating it for people who might not have read all of
>> this thread).
>>
>> What's up for discussion is whether we want to talk about handling invalid
>> headers, how to do that, and what kind of conformance comes with that.
>
> The path we're going down is to provide optional instructions for how
> to consume the Content-Disposition header.  We have two competing sets
> of contraints:
>
> 1) These instructions are limited to things that user agents are
> willing to implement.

Not necessarily. I'm not aware of an IETF rule for that. As a matter of 
fact, what "UAs are willing to implement" may change over time.

> 2) These instructions are limited to aligning with the requirements on
> valid header fields.
>
> Where we're having difficulty agreeing is where these two constraints
> are in conflict.  There are three options:
>
> A) Relax one of the constraints.
> B) Change what user agents are willing to implement.
> C) Change the definition of valid header fields to exclude syntax
> where user agents aren't willing to implement the required behavior.

D) Live with the fact that some user agents are doing the wrong thing, 
and document it. Warn about what can happen. Suggest alternatives when 
available.

> ...
>> So, is there any *hope* that we'll see that feedback from the other browser
>> vendors? Maciej? Anne? Eric? Robert?
>
> I suspect asking for their input in the middle of this giant thread
> isn't the most likely way to get feedback.  :)

All of them (maybe except Robert whose name was the first Firefox person 
who came to mind who should be on this list) are aware of this discussion.

> ...
>>> Personally, I don't feel that strongly about it.  However, I do feel
>>> strongly about keeping the %-decoding in the UA Appendix.  If you're
>>> fine with having both the warning and the %-decoding in the appendix,
>>> that's a workable solution.  If you feel these are in conflict, then
>>> I'd rather change the warning to an error and keep the %-decoding in
>>> the appendix than remove the %-decoding from the appendix.
>>
>> I think they are in conflict, and wouldn't want to see any recommendation to
>> do %-unescaping. It would make the spec incompatible with the previous spec
>> (a conformance change), and it's only implemented in two out of six UAs I'm
>> testing.
>
> Ok, then it sounds like we should make including a % syntactically
> invalid.  Is there another solution?
> ...

Yes, we can just re-state what has been true given the existing specs. 
It's a bug to do percent-unescaping here, but some UAs do it nevertheless.

> ...
>> By all means please provide the complete text for the introduction of the
>> Appendix. That's essential to understand what the expectations on
>> implementations are.
>>
>> If there aren't any, such as "we just think this is a good idea" I'd propose
>> to have it in a separate document which may or not may be a WG work item.
>
>> From my perspective, the introductory text is less important than the
> technical content.  If you'd like to propose introductory text that
> makes you happy, that might be a more productive starting point than
> me guess wildly.
> ...

OK, here's my proposal:

"The User Agent behavior outlined below is derived from one specific 
implementation, with some aspects that cause invalid behavior for valid 
header fields subsequently removed. We observe that UAs differ in their 
handling of invalid headers, and that there's currently no uniform 
algorithm that could be specified."

> ...
>>> I'm not sure what you mean by "fixed."  It's unclear whether user
>>> agents want to do \-decoding on the file name, especially because \ is
>>> a common directory separator on some operating systems.
>>
>> "Fixing" means "changing things to work as specified".
>
> Alternatively, we could change the specification to match what user
> agents are willing to implement.

Alternatively, we could observe that some UAs get this right, and 
conclude that there's no agreement that breaking the specs is required.

>> So the question here is whether it would break things because there are
>> servers sending unescaped backslashes. As far as I can tell, sending path
>> separators in the filename indicates a bug in the sender, or an attempt to
>> trick the user agent to do something it's not supposed to do.
>>
>> So the "harm" of actually doing the unescaping would be that for a filename
>> that needs to be postprocessed anyway, the problematic character would be
>> filtered in a different way.
>>
>> Starting with
>>
>>   filename="a\bc"
>>
>> the broken implementation sees "a" and "bc" separated by a path separator,
>> and will prost-process this to "abc", "a_bc" or "bc" (where _ could be a
>> different replacement character).
>>
>> A correct implementation sees "abc".
>>
>> I don't think there's a problem here.
>
> None of the user agents do \-decoding.  I don't see any value in them starting.

Konqueror does. Opera does. Mozilla has a proposed patch 
(<https://bugzilla.mozilla.org/show_bug.cgi?id=584418>).

> ...
>>>>> If the Content-Disposition header field is non-empty and fails to parse,
>>>>> then the disposition type is "attachment". Otherwise, the
>>>>> disposition-type
>>>>> is "inline".
>>>>
>>>> Neither "filename" nor "name" are disposition types.
>>>
>>> Indeed.
>>
>> It's confusing and makes reviewing the text harder than it needs to be.
>
> Feel free to propose improvements.  I'm not sure part is confusing you.

It would be less confusing if the grammar actually contained the legal 
values for "disposition".

> ...
>> Note that fixing the quoted-string handling has already a proposed patch in
>> Mozilla.
>
> Bug number?
> ...

<https://bugzilla.mozilla.org/show_bug.cgi?id=584418>

> ...
>> Writing tests doesn't come for free, and every test I add needs to be
>> maintained and re-run. So I'd prefer to understand it's worth the time
>> before.
>
> Well, then I can answer you question.  Yes.

Added as <http://greenbytes.de/tech/tc2231/#attwithfilenamepctandiso>. 
Chrome behaves as described by you, IE does not (the others do not 
percent-unescape anyway).

 > ...
>> So can you convince Safari, Opera, Firefox, and Konqueror to adopt this
>> handling as well? Otherwise I don't think we'll make progress. Two UAs do
>> something funny, the other four do not. I don't want the specification to
>> reflect those implementation bugs -- even if they can't be realistically
>> removed from these UAs anytime soon.
>
> I think it's more likely that these user agents will adopt %-decoding
> than that the other user agents will drop %-decoding.  Now, that's a
> matter of opinion and reflects a judgement call.
 > ...

Well, let's disagree on that.

The discussion has been going on for a very long time now, and I see no 
change.

> ...
>> Before we invest too much time into this: do you have evidence that using a
>> grammar other than the one already defines is really needed?
>
> Yes.  The one currently in the document doesn't match every sequence
> of input characters.  Therefore, it does not define a complete parser.
>   We desire a complete specification of how to parse the header field.
>
>> Maybe you could list certain invalid header instances you want to handle,
>> and what the result should be?
>
> Content-Disposition: attachment; filename="bar
>
> Result =>  "bar

See <http://greenbytes.de/tech/tc2231/#attbrokenquotedfn2>.

Only Safari does what you say.

> Content-Disposition: attachment; filename=foo"bar;baz"qux
>
> Result =>  foo"bar;baz"qux

See <http://greenbytes.de/tech/tc2231/#attbrokenquotedfn3>.

Only Opera and Safari do what you say.

> Content-Disposition: "foo; filename=bar;baz"; filename=qux
>
> Result =>  qux

See <http://greenbytes.de/tech/tc2231/#attmissingdisposition3>.

None of the UAs does what you say.

> etc

Yes, we can invent all kinds of new examples. What we see is that 
because the UAs all have different parsers, they fail in different ways 
(or in some cases, pass, thanks Konqueror and Opera).

There is no interop for this kind of garbage, so I see no incentive to 
define something other than rejecting those values.

>> If we want to keep handling valid header fields properly, we *probably*
>> better start from the base grammar, and relax certain constructs.
>
> Feel free to propose a modification to the base grammar that matches
> every sequence of bytes and computes the needed segmentation.  It's
> not at all obvious to me how to approach constructing the grammar from
> that starting point.

It does not need to match any sequence. It's completely ok to state what 
it means not to match the grammar.

So let's start with:

   content-disposition = "Content-Disposition" ":"
                          disposition-type *( ";" disposition-parm )

   disposition-type    = "inline" | "attachment" | disp-ext-type
                       ; case-insensitive
   disp-ext-type       = token

   disposition-parm    = filename-parm | disp-ext-parm

   filename-parm       = "filename" "=" value
                       | "filename*" "=" ext-value

   disp-ext-parm       = token "=" value
                       | ext-token "=" ext-value
   ext-token           = <the characters in token, followed by "*">

We could relax requirements on ordering and cardinality of 
disposition-types:

   content-disposition = "Content-Disposition" ":"
                         ( disposition-parm / disposition-type )
                          *( ";" ( disposition-parm / disposition-type ))

and then relax the grammars for "token" and "quoted-string" to allow 
more characters that are not needed for detecting names and values. I 
think that would reduce the risk of defining something that breaks valid 
headers.

Best regards, Julian
Received on Saturday, 4 December 2010 20:00:08 UTC