Re: Content-Disposition next steps

On Sat, Dec 4, 2010 at 3:18 AM, Julian Reschke <julian.reschke@gmx.de> wrote:
> On 04.12.2010 00:56, Adam Barth wrote:
>> On Thu, Dec 2, 2010 at 4:26 AM, Julian Reschke<julian.reschke@gmx.de>
>>  wrote:
>>>
>>> I added
>>>
>>> <http://greenbytes.de/tech/tc2231/#attmissingdisposition2>
>>>
>>> which fails for FF3/Chrome/Chrome9 (I see shared bugs :-),
>>
>> Hum...  This one sounds a bit tricky.  It's not clear to me which
>> option is better.
>
> The specs are clear on it, and all UAs except FF and Chrome get this right.
> This seems to be a case where it's clear what's "better" (even if we may
> disagree on the metrics for "better").

I'm not sure it's so clear-cut.  I'll dig around a bit and see what
more information I can find.

>>> I'm not totally sure what exactly to test; please elaborate.
>>
>> Content-Disposition: xfilename=foo.txt
>
> I have added a test for
>
>  Content-Disposition: attachment; xfilename=foo.txt
>
> (the one you proposed would have been invalid anyway). This is
>
>  <http://greenbytes.de/tech/tc2231/#attconfusedparam>
>
> and it fails for Chrome only (reported as
> <http://code.google.com/p/chromium/issues/detail?id=65423>).

Great.  Fixed.

>>> <http://greenbytes.de/tech/tc2231/#dispextbadfn>
>>>
>>> failing in Chrome only.
>>
>> Oh good.  I'll update the wiki.  With a more elaborate grammar.
>
> See also <http://code.google.com/p/chromium/issues/detail?id=65276>.

Thanks for filing these bugs.  That's helpful.  My plan is to change
Chrome to match the spec once the spec has settled down.

>>> From my perspective, I'd like there to be a specification of how a
>>
>> user agent should consume the Content-Disposition header.  I started
>> with Chrome's behavior because I'm most familiar with it and because
>> there's evidence that at least one implementor is willing to ship that
>> behavior.
>
> The specification says how to consume *valid* header instances (I know you
> know that, but I'm repeating it for people who might not have read all of
> this thread).
>
> What's up for discussion is whether we want to talk about handling invalid
> headers, how to do that, and what kind of conformance comes with that.

The path we're going down is to provide optional instructions for how
to consume the Content-Disposition header.  We have two competing sets
of contraints:

1) These instructions are limited to things that user agents are
willing to implement.
2) These instructions are limited to aligning with the requirements on
valid header fields.

Where we're having difficulty agreeing is where these two constraints
are in conflict.  There are three options:

A) Relax one of the constraints.
B) Change what user agents are willing to implement.
C) Change the definition of valid header fields to exclude syntax
where user agents aren't willing to implement the required behavior.

>> Ideally, we'd get feedback from other user agent implementors about
>> what they'd like the specification to say.  We'd then have an easier
>> time polishing away the more exotic behaviors.  Instead, we're relying
>> on our collective judgement.
>
> As I said before, I'm not very interested in getting more interop for broken
> headers. What concerns me much more is interop for valid headers, and
> delaying the spec again and again doesn't help here.

Indeed.  We have different goals and priorities.

> So, is there any *hope* that we'll see that feedback from the other browser
> vendors? Maciej? Anne? Eric? Robert?

I suspect asking for their input in the middle of this giant thread
isn't the most likely way to get feedback.  :)

>>> Do you want UAs to converge on that behavior?
>>
>> Yes.
>
> Then I'd propose that you add an introduction to the Wiki clearly saying how
> you want it to appear in the spec.
>
> Personally, I don't think it's a useful exercise for this Working Group, as
> the observed behavior for broken headers differs a lot (from proper parsing
> as in Konqueror to naive substring matching in some other UAs :-).
>
> This is different, from, for instance, handling broken HTML (where we
> actually have evidence that UAs need to do this to stay in business).

Again, we have perspectives on what's useful.

>>> Even those who currently reject invalid header fields?
>>
>> If a UA wants to reject invalid header fields, that sounds fine to me.
>>  What I'd like to avoid is there being N different ways of consuming
>> Content-Disposition, where N is the number of user agent
>> implementations.
>
> There should be one way to handle valid characters. I'm less interested in
> consistent behavior for invalid headers *unless* there's a related security
> risk.

Fortunately we control the definition of what's valid so we can make
this the case by picking option (C) above.

>>> The point is that we already have spec text, which is a warning. Do you
>>> want
>>> it to change?
>>
>> Personally, I don't feel that strongly about it.  However, I do feel
>> strongly about keeping the %-decoding in the UA Appendix.  If you're
>> fine with having both the warning and the %-decoding in the appendix,
>> that's a workable solution.  If you feel these are in conflict, then
>> I'd rather change the warning to an error and keep the %-decoding in
>> the appendix than remove the %-decoding from the appendix.
>
> I think they are in conflict, and wouldn't want to see any recommendation to
> do %-unescaping. It would make the spec incompatible with the previous spec
> (a conformance change), and it's only implemented in two out of six UAs I'm
> testing.

Ok, then it sounds like we should make including a % syntactically
invalid.  Is there another solution?

>>> See above, I'm struggling to understand what the proposal actually is.
>>> (such
>>> as: placement, introduction, implication on conformance, ...).
>>
>> We've been talking about putting it in the appendix.  I'm not sure
>> whether you need to reference it from the introduction.  It doesn't
>> affect conformance for any conformance class.
>
> By all means please provide the complete text for the introduction of the
> Appendix. That's essential to understand what the expectations on
> implementations are.
>
> If there aren't any, such as "we just think this is a good idea" I'd propose
> to have it in a separate document which may or not may be a WG work item.

>From my perspective, the introductory text is less important than the
technical content.  If you'd like to propose introductory text that
makes you happy, that might be a more productive starting point than
me guess wildly.

>>> I'm less concerned about processing invalid messages, but I'll say again
>>> that there's little interoperability for those messages, so I just don't
>>> see
>>> why we care.
>>
>> We care because we want there to be more interoperability in the
>> future.  The goal of writing standards is to improve interoperability.
>
> Yes, but we usually draw a line between things we care about and things we
> don't. We happen to disagree on where to draw that line.

Indeed.

>>>> You write:
>>>>
>>>> fail (saves "oo.html" (what's going on here?, see Chrome Issue 52577))
>>>>
>>>> what's going on is that the "\" is being treated as a directory
>>>> separator and Chrome is giving you the "leaf" name of the path.
>>>
>>> OK, so it fails to do the unescaping on quoted-string. It would be great
>>> if
>>> this could be fixed.
>>
>> I'm not sure what you mean by "fixed."  It's unclear whether user
>> agents want to do \-decoding on the file name, especially because \ is
>> a common directory separator on some operating systems.
>
> "Fixing" means "changing things to work as specified".

Alternatively, we could change the specification to match what user
agents are willing to implement.

> So the question here is whether it would break things because there are
> servers sending unescaped backslashes. As far as I can tell, sending path
> separators in the filename indicates a bug in the sender, or an attempt to
> trick the user agent to do something it's not supposed to do.
>
> So the "harm" of actually doing the unescaping would be that for a filename
> that needs to be postprocessed anyway, the problematic character would be
> filtered in a different way.
>
> Starting with
>
>  filename="a\bc"
>
> the broken implementation sees "a" and "bc" separated by a path separator,
> and will prost-process this to "abc", "a_bc" or "bc" (where _ could be a
> different replacement character).
>
> A correct implementation sees "abc".
>
> I don't think there's a problem here.

None of the user agents do \-decoding.  I don't see any value in them starting.

>>> Looking at
>>>
>>> <http://trac.tools.ietf.org/wg/httpbis/trac/wiki/ContentDispositionErrorHandling?version=7>:
>>>
>>>> Determining the Disposition
>>>>
>>>> To determine the disposition-type, parse the Content-Disposition header
>>>> field using the following grammar:
>>>>
>>>> unparsed-string = *LWS nominal-type *OCTET
>>>> nominal-type    = "inline" / "filename" / "name" / ";"
>>>>
>>>> If the Content-Disposition header field is non-empty and fails to parse,
>>>> then the disposition type is "attachment". Otherwise, the
>>>> disposition-type
>>>> is "inline".
>>>
>>> Neither "filename" nor "name" are disposition types.
>>
>> Indeed.
>
> It's confusing and makes reviewing the text harder than it needs to be.

Feel free to propose improvements.  I'm not sure part is confusing you.

>> How would you like quoted strings to be handled.  According to your
>> tests, what we should do is strip off matching leading and trailing "
>> characters and be careful to capture ; inside of ".  However, your
>> tests show that we should not \-decode the value.  I'm happy to make
>> that change.
>
> Quoted strings should be handled as specified, removing the quotes and
> performing \-unescaping. The tests show that indeed a majority of UAs get
> this wrong but that doesn't make it magically right.

I disagree that we should do \-unescaping.

> I'd prefer that we invest our time to reduce the bugs in the UAs, instead of
> documenting them.

Again, we have different priorities.

> Note that fixing the quoted-string handling has already a proposed patch in
> Mozilla.

Bug number?

>>>> Decoding the File Name
>>>>
>>>> To filename-decode an encoded-string, use the following algorithm:
>>>>
>>>>   1. If the encoded-string contains non-ASCII characters, emit the
>>>> encoded-string (decoded as ISO-8859-1) and abort these steps.
>>>
>>> So by adding a non-ASCII character I can prevent percent-unescaping? Is
>>> this
>>> implemented anywhere?
>>
>> I'd encourage you to write a test and find out.  :)
>
> Writing tests doesn't come for free, and every test I add needs to be
> maintained and re-run. So I'd prefer to understand it's worth the time
> before.

Well, then I can answer you question.  Yes.

>>> <permathread>Why would we recommend something that only Chrome and IE do
>>> (and IE only does for some locals)</permathread>
>>
>> As you indicate, we've discussed this issue at length.  If you can
>> convince IE to remove this behavior, then we might be able to remove
>> it from this document.  Otherwise, we'd like to compete with IE in
>> this respect.
>
> So can you convince Safari, Opera, Firefox, and Konqueror to adopt this
> handling as well? Otherwise I don't think we'll make progress. Two UAs do
> something funny, the other four do not. I don't want the specification to
> reflect those implementation bugs -- even if they can't be realistically
> removed from these UAs anytime soon.

I think it's more likely that these user agents will adopt %-decoding
than that the other user agents will drop %-decoding.  Now, that's a
matter of opinion and reflects a judgement call.

>> I've taken a stab at writing this requirement with a grammatical
>> presentation.  I'd appreciate some help with the grammar as it's not
>> my strong suit.  Here's what I've got (also on the wiki):
>>
>> [[
>> unparsed-string = unbalanced-block / block * ( ";" block ) [ ";"
>> unbalanced-block ]
>> block = *run
>> unbalanced-block = *run unbalanced-run
>> run = unquoted-run / quoted-run
>> unquoted-run = non-quote *boring-octet
>> quoted-run =<">  *non-quote<">
>> unbalanced-run =<">  *non-quote
>> non-quote =<OCTET, except<">  >
>> boring-octet =<OCTET, except<">  and ";">
>> ]]
>>
>> I'm not overly confident in the first grammar.  What I'm trying to
>> encode is a parser with two states: initial and in-quote, together
>> with the following transition rules:
>>
>> == State: initial ==
>> <">  =>  Buffer the current character, switch to the in-quote state, and
>> proceed to the next character.
>> ";" =>  Emit the buffered characters, if any, and proceed to the next
>> character.
>> EOF =>  Emit the buffered characters, if any, and terminate.
>> OTHER =>  Buffer the current character and proceed to the next character.
>>
>> == State: in-quote ==
>> <">  =>  Buffer the current character, switch to the initial state, and
>> proceed to the next character.
>> EOF =>Emit the buffered characters, if any, and terminate.
>> OTHER =>  Buffer the current character and proceed to the next character.
>>
>> My inclination is to write the state machine into the wiki, but I'm
>> hesitating because my understanding is that you all would prefer a
>> grammatical presentation.
>
> Before we invest too much time into this: do you have evidence that using a
> grammar other than the one already defines is really needed?

Yes.  The one currently in the document doesn't match every sequence
of input characters.  Therefore, it does not define a complete parser.
 We desire a complete specification of how to parse the header field.

> Maybe you could list certain invalid header instances you want to handle,
> and what the result should be?

Content-Disposition: attachment; filename="bar

Result => "bar

Content-Disposition: attachment; filename=foo"bar;baz"qux

Result => foo"bar;baz"qux

Content-Disposition: "foo; filename=bar;baz"; filename=qux

Result => qux

etc

> If we want to keep handling valid header fields properly, we *probably*
> better start from the base grammar, and relax certain constructs.

Feel free to propose a modification to the base grammar that matches
every sequence of bytes and computes the needed segmentation.  It's
not at all obvious to me how to approach constructing the grammar from
that starting point.

Adam

Received on Saturday, 4 December 2010 18:17:28 UTC