Content-Disposition handling ?

Hi,
I'm sorry to 'resurrect' the old topic. I skimmed over the thread last
August (with 50+ emails), but I'm not sure of the conclusion although it
seems that the WG leaned toward RFC 2231 without continuation line support.


I'm writing this hoping that it's still not too late to give
Chrome/Firefox-side of the story.

When Chrome was released in September, there is a "bug" report on Chrome
about decoding %-encoded UTF-8. I replied that it's not a bug per se because
Chrome just tried to be compatible with many formats emitted by web servers
in the wild.  Then, I wanted to join this mailing list and explain what
Chrome tries to do.

I wrote the filename parameter handling code for both* *Firefox (necko) and
Chrome. (the test cases I used are at
http://www.i18nl10n.com/moztest/download.html )

There's a difference between two as noticed by some of you back in
September.  The most prominent is that Firefox support RFC 2231 (even
multi-line ones) while Chrome does not.
Firefox supports the full form of RFC 2231 not because I found that there's
any web server emitting RFC 2231-style C-D header (which does not mean that
there is not any, obviously) but because that part of the code is shared
with Thunderbird. For email clients, it's essential that RFC 2231 is
supported (including continuation line and 'lang' param). When I wrote the
C-D handling code for Chrome, I didn't include RFC 2231 support because I
haven't seen any increase in the use of RFC 2231 format by web servers in
the wild. Again, this is not backed by any systematic survey.

 Another is that Chrome does decode %-encoded byte sequences (UTF-8 only at
the moment, but given the IE's behavior and a lot of web servers that depend
on it especially in East Asia [1], it seems that I have to expand that to
cover non-UTF-8 %-encoded strings as well.

There are also common behaviors, too. Both supports RFC 2047 and both
support raw-8byte-sequences in UTF-8. If a raw 8byte sequence cannot be
interpreted as UTF-8, both try to interpret it as the encoding of a
referring page (Firefox implementation is more complete while Chrome's only
works with 'Save As'). [2]  Safari tries to do something similar for raw
8bit byte sequences, but currently it does not work. Note that neither tries
to interpret it as ISO-8859-1/Windows-1252, which leads to my next point.

In the thread on the topic in August, I also saw a lot of mentioning of
ISO-8859-1 being the default. I'm afraid that's rather Western-Euro-centric
view. Even though the current HTTP spec says ISO-8859-1 be the default, in
practice, a lot of local legacy encodings have been used by web servers/ web
server-side programs (see http://crbug.com/1148 ). The same is true of the
old HTML spec, according to which ISO-8859-1 is to be assumed in the absence
of the charset specification. However, in practice, web browsers allow users
to set the default charset to assume in such cases (and the default value
out of the box is locale-dependent. Korean Firefox/IE/Chrome have it set to
EUC-KR, Simplified Chinese versions have it set to GBK, French versions have
it set to ISO-8859-1/Windows-1252, etc).

Unless I missed something, in the thread in last August, I couldn't find
discussion about what web servers (various server-side programs) emit. As
you know well, web browsers are rather on the receiving end (when it comes
to decoding Content-Disposition header) and have to support formats in
widespread use by web servers.

I think a systematic survey of what's actually emitted by webservers in the
wild is necessary if it's not done yet. I meant to do this for a while, but
haven't managed to do it, yet.

Thank you for reading my email,

Jungshik


[1]
http://markmail.org/message/yi4lzgjf4rhby7k7#query:http-wg%20content-disposition+page:1+mid:xqygbmcwo4b44jrm+state:results
 http://crbug.com/17676

[2] http://crbug.com/1148

Received on Friday, 24 July 2009 18:38:33 UTC