Re: Content-Disposition next steps from Adam Barth on 2010-12-01 (ietf-http-wg@w3.org from October to December 2010)

From: Adam Barth <ietf@adambarth.com>
Date: Wed, 1 Dec 2010 14:17:01 -0800
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <AANLkTikgX6=T17SoFya9TYGVSxPj9Qm8mZa8TO_V9iEd@mail.gmail.com>
On Wed, Dec 1, 2010 at 1:44 PM, Julian Reschke <julian.reschke@gmx.de> wrote:
> But your proposal is already in the grey area because it conflicts with the
> literal reading of the grammar, by applying certain decoding operations on
> something that is not *supposed* to be encoded, or by using a charset
> default which isn't backed by the specs.
>
> This *could* be justified by claiming that the filename is advisory only,
> but it's really not satisfying, in particular when we have evidence that
> many UAs get away without it.

Hum...  There are two parts to this issue:

1) How we expect user agents to behave.
2) What we write in the document.

IMHO, the user agents who are interested in this information are not
going to stop performing this decoding (to issue 1), and the document
should tell the truth with respect to what these user agents are going
to do (to issue 2).

>> If it's helpful for you to think about this information as recommended
>> error recovery, that's fine with me.  When consuming the
>
> Error recovery implies a detectable error. A big part of what you propose
> changes the interpretation of valid fields.

Perhaps the term "error recovery" isn't helpful?  Another option, of
course, is to adjust the definition of what is valid to exclude these
pieces of syntax.

>> Content-Disposition header, it's not especially important to know
>> whether the header is well-formed (i.e., generated in accordance to
>> the generation requirements).
>
> Unless you want to ignore it otherwise.

That is certainly one option for consumers.  However, in this
appendix, we're concerning ourselves with user agents who wish to
process ill-formed headers.

>>> I note that you have handling of RFC2047-style encoding in there. That's
>>> something only Chrome and Firefox are doing, so I'd like to understand
>>> why
>>> you think it's needed, and whether you think Opera/Safari/Konqueror/IE
>>> should implement that (given the fact that changes the semantics of
>>> values
>>> that are valid).
>>
>> Yeah, I wasn't sure whether to include the RFC2047 encoding.  I
>> certainly wouldn't recommend that servers generate Content-Disposition
>> headers using that encoding.  However, if I were writing a new user
>> agent today, I might well include RFC2047 support.  It boils down to a
>> cost/benefit analysis.  Some comments in the Chrome code indicate that
>> there are servers that do generate RFC2047-encoded Content-Disposition
>> headers, so there's at least some benefit.
>
> But there's also damage, because there's a small risk to misinterpret a
> value that just happens to look like 2047-encoded.

I'd put that into the "cost" column in the cost/benefit analysis.

> The same is true for the other recoveries you propose:
>
>>            i)  If the word is a well-formed UTF-8 string, emit the word
>>                (decoded as UTF-8) and proceed to the next grammatical
>> element.
>
> According to RFC2616, the default is ISO-8859-1, and IE/Opera/Konqueror do
> exactly that (at least in my locale):
> <http://greenbytes.de/tech/tc2231/#attwithutf8fnplain>

According to your tests Firefox, Chrome, and Safari use UTF-8.  Given
a free choice of UTF-8 or ISO-8859-1, I'd pick UTF-8, as I've done
here.

>>        c) Let the url-unescaped-word be the word %-unescaped.
>>
>>        d) Emit the url-unescaped-word (decoded as UTF-8) and proceed to
>> the
>>           next grammatical element.  (There's actually more sadness here
>> if
>>           the url-unescaped-word isn't valid UTF-8.)
>
> That overloads the syntax of the parameter, and it is not done in
> FF/Opera/Safari/Konqueror:
> <http://greenbytes.de/tech/tc2231/#attwithfnrawpctenca>

Indeed.  We've discussed this issue at length.  For senders, I suspect
that the optimal way of generating Content-Disposition headers is to
avoid using the % character because that character is interpreted
differently by different user agents.  For receivers, I suspect that
the optimal way of consuming Content-Disposition headers is to
%-decode them, as described above.

Perhaps the best outcome, then, is to forbid servers from generating
the % character?  That way the syntax won't be "overloaded."

> Yes, the current landscape is a mess, but it's a *different* mess in each of
> the various UAs. Your recommendation appears to merge all the bad
> workarounds. My recommendation would be to try to get slowly rid of them.

I suspect our different approaches reflect our different experiences.

> Statistics on which of these workarounds are *really* used would be useful.

Indeed.

>> One thing that might make sense is to demarcate those instructions as
>> again optional, that is an optional piece of the optional error
>> recovery, if you like.
>
> That would apply to all of RFC2047, UTF-8 defaulting, and
> percent-unescaping.
>
> Are you willing to rephrase the proposal accordingly?

If rephrasing the proposal would be helpful, I'm happy to do that.
What specifically would you like rephrased?

Adam
Received on Wednesday, 1 December 2010 22:18:07 UTC