Re: Content-Disposition handling ? from Julian Reschke on 2009-07-24 (ietf-http-wg@w3.org from July to September 2009)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sat, 25 Jul 2009 00:21:02 +0200
To: "Jungshik Shin (신정식, 申政湜)" <jungshik@google.com>
CC: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <4A6A33CE.6080209@gmx.de>
Jungshik Shin (신정식, 申政湜) wrote:
> Hi,
> 
> I'm sorry to 'resurrect' the old topic. I skimmed over the thread last 
> August (with 50+ emails), but I'm not sure of the conclusion although it 
> seems that the WG leaned toward RFC 2231 without continuation line 
> support.   

Indeed. We discussed this last year in Dublin (IETF-72), and later on, I 
decided to work on that profile, which is draft-reschke-rfc2231-in-http 
(<http://greenbytes.de/tech/webdav/draft-reschke-rfc2231-in-http-02.html>). 
I also started work on a test suite, see 
<http://greenbytes.de/tech/tc2231/>.

> I'm writing this hoping that it's still not too late to give 
> Chrome/Firefox-side of the story. 
> 
> When Chrome was released in September, there is a "bug" report on Chrome 
> about decoding %-encoded UTF-8. I replied that it's not a bug per se 

Yes, I raised that one.

> because Chrome just tried to be compatible with many formats emitted by 
> web servers in the wild.  Then, I wanted to join this mailing list and 
> explain what Chrome tries to do. 
> 
> I wrote the filename parameter handling code for both* *Firefox (necko) 
> and Chrome. (the test cases I used are 
> at http://www.i18nl10n.com/moztest/download.html )
> 
> There's a difference between two as noticed by some of you back in 
> September.  The most prominent is that Firefox support RFC 2231 (even 
> multi-line ones) while Chrome does not.
> Firefox supports the full form of RFC 2231 not because I found that 
> there's any web server emitting RFC 2231-style C-D header (which does 
> not mean that there is not any, obviously) but because that part of the 

I personally encountered the problem of internationalizing filename 
parameters many years ago when working on SAP's Knowledge Management 
product, and back then discovered that

(a) there's an IETF standards track document providing a solution,

(b) Firefox and Opera supported it, and

(c) IE wasn't, but then, there was no reliable way to do it in IE at 
all, as the "solution" as proposed by Microsoft depends (or at least 
back then depended) on local settings of IE.

We thus decided to default to RFC2231 encoding (with UTF8 charset), and 
only use the IE workaround for IE (based on the User Agent); and we also 
had to tell Asian customers that it wouldn't work with their default 
configuration. (Note this predates both Safari and Chrome).

> code is shared with Thunderbird. For email clients, it's essential that 
> RFC 2231 is supported (including continuation line and 'lang' param). 
> When I wrote the C-D handling code for Chrome, I didn't include RFC 2231 
> support because I haven't seen any increase in the use of RFC 2231 
> format by web servers in the wild. Again, this is not backed by any 
> systematic survey.

Having that review would be interesting. (Note to volunteers: keep in 
mind that some servers vary the encoding mechanism based on the 
User-Agent header).

>  Another is that Chrome does decode %-encoded byte sequences (UTF-8 only 
> at the moment, but given the IE's behavior and a lot of web servers that 
> depend on it especially in East Asia [1], it seems that I have to expand 
> that to cover non-UTF-8 %-encoded strings as well. 

The question is: how does the negotiation work?

I think it's a much better approach to stick to the only solution that 
actually works predictably and is unambiguous. While working on the 
aforementioned draft I talked to representatives of some of the 
companies that currently do not support RFC 2231, and their feedback was 
that having a well-defined simple profile would be a step into the right 
direction.

> There are also common behaviors, too. Both supports RFC 2047 and both 
> support raw-8byte-sequences in UTF-8. If a raw 8byte sequence cannot be 

We discussed RFC 2047, and came to the conclusion it's not allowed for 
parameter values.

> interpreted as UTF-8, both try to interpret it as the encoding of a 
> referring page (Firefox implementation is more complete while Chrome's 
> only works with 'Save As'). [2]  Safari tries to do something similar 
> for raw 8bit byte sequences, but currently it does not work. Note that 
> neither tries to interpret it as ISO-8859-1/Windows-1252, which leads to 
> my next point. 
> 
> In the thread on the topic in August, I also saw a lot of mentioning of 
> ISO-8859-1 being the default. I'm afraid that's rather 
> Western-Euro-centric view. Even though the current HTTP spec says 

It is, but it's nothing we can change.

> ISO-8859-1 be the default, in practice, a lot of local legacy encodings 
> have been used by web servers/ web server-side programs (see 
> http://crbug.com/1148 ). The same is true of the old HTML spec, 
> according to which ISO-8859-1 is to be assumed in the absence of the 
> charset specification. However, in practice, web browsers allow users to 
> set the default charset to assume in such cases (and the default value 
> out of the box is locale-dependent. Korean Firefox/IE/Chrome have it set 
> to EUC-KR, Simplified Chinese versions have it set to GBK, French 
> versions have it set to ISO-8859-1/Windows-1252, etc). 
> 
> Unless I missed something, in the thread in last August, I couldn't find 
> discussion about what web servers (various server-side programs) emit. 
> As you know well, web browsers are rather on the receiving end (when it 
> comes to decoding Content-Disposition header) and have to support 
> formats in widespread use by web servers. 
> 
> I think a systematic survey of what's actually emitted by webservers in 
> the wild is necessary if it's not done yet. I meant to do this for a 
> while, but haven't managed to do it, yet. 
> ...

Again, that data would be interesting.

I imagine that server implementors that encounter this issue, such as 
web interfaces to document management systems that allow "arbitrary" 
filenames, usually end up:

- ignoring the problem (as there's no interoperable answer), or

- just support IE (I guess that's the case for Sharepoint), or

- sniff the UA and special-case Internet Explorer.

My understanding though is that using percent encoding or raw bytes in 
the message does not interoperate between user agents and breaks the spec.

On the other hand, RFC 2231 based encoding can at least be deployed 
right now, as clients that do not understand the "name*" notation just 
ignore it (well, except Konqueror, but that's a different story).

The transition would be smooth if UAs that support RFC 2231 encoding 
would, when both formats are present, select the RFC 2231 variant. 
Unfortunately that's not the case today (see 
<http://greenbytes.de/tech/tc2231/#attfnboth>).

BR, Julian
Received on Friday, 24 July 2009 22:21:48 UTC