- From: Julian Reschke <julian.reschke@gmx.de>
- Date: Sat, 25 Jul 2009 00:21:02 +0200
- To: "Jungshik Shin (신정식, 申政湜)" <jungshik@google.com>
- CC: HTTP Working Group <ietf-http-wg@w3.org>
Jungshik Shin (신정식, 申政湜) wrote: > Hi, > > I'm sorry to 'resurrect' the old topic. I skimmed over the thread last > August (with 50+ emails), but I'm not sure of the conclusion although it > seems that the WG leaned toward RFC 2231 without continuation line > support. Indeed. We discussed this last year in Dublin (IETF-72), and later on, I decided to work on that profile, which is draft-reschke-rfc2231-in-http (<http://greenbytes.de/tech/webdav/draft-reschke-rfc2231-in-http-02.html>). I also started work on a test suite, see <http://greenbytes.de/tech/tc2231/>. > I'm writing this hoping that it's still not too late to give > Chrome/Firefox-side of the story. > > When Chrome was released in September, there is a "bug" report on Chrome > about decoding %-encoded UTF-8. I replied that it's not a bug per se Yes, I raised that one. > because Chrome just tried to be compatible with many formats emitted by > web servers in the wild. Then, I wanted to join this mailing list and > explain what Chrome tries to do. > > I wrote the filename parameter handling code for both* *Firefox (necko) > and Chrome. (the test cases I used are > at http://www.i18nl10n.com/moztest/download.html ) > > There's a difference between two as noticed by some of you back in > September. The most prominent is that Firefox support RFC 2231 (even > multi-line ones) while Chrome does not. > Firefox supports the full form of RFC 2231 not because I found that > there's any web server emitting RFC 2231-style C-D header (which does > not mean that there is not any, obviously) but because that part of the I personally encountered the problem of internationalizing filename parameters many years ago when working on SAP's Knowledge Management product, and back then discovered that (a) there's an IETF standards track document providing a solution, (b) Firefox and Opera supported it, and (c) IE wasn't, but then, there was no reliable way to do it in IE at all, as the "solution" as proposed by Microsoft depends (or at least back then depended) on local settings of IE. We thus decided to default to RFC2231 encoding (with UTF8 charset), and only use the IE workaround for IE (based on the User Agent); and we also had to tell Asian customers that it wouldn't work with their default configuration. (Note this predates both Safari and Chrome). > code is shared with Thunderbird. For email clients, it's essential that > RFC 2231 is supported (including continuation line and 'lang' param). > When I wrote the C-D handling code for Chrome, I didn't include RFC 2231 > support because I haven't seen any increase in the use of RFC 2231 > format by web servers in the wild. Again, this is not backed by any > systematic survey. Having that review would be interesting. (Note to volunteers: keep in mind that some servers vary the encoding mechanism based on the User-Agent header). > Another is that Chrome does decode %-encoded byte sequences (UTF-8 only > at the moment, but given the IE's behavior and a lot of web servers that > depend on it especially in East Asia [1], it seems that I have to expand > that to cover non-UTF-8 %-encoded strings as well. The question is: how does the negotiation work? I think it's a much better approach to stick to the only solution that actually works predictably and is unambiguous. While working on the aforementioned draft I talked to representatives of some of the companies that currently do not support RFC 2231, and their feedback was that having a well-defined simple profile would be a step into the right direction. > There are also common behaviors, too. Both supports RFC 2047 and both > support raw-8byte-sequences in UTF-8. If a raw 8byte sequence cannot be We discussed RFC 2047, and came to the conclusion it's not allowed for parameter values. > interpreted as UTF-8, both try to interpret it as the encoding of a > referring page (Firefox implementation is more complete while Chrome's > only works with 'Save As'). [2] Safari tries to do something similar > for raw 8bit byte sequences, but currently it does not work. Note that > neither tries to interpret it as ISO-8859-1/Windows-1252, which leads to > my next point. > > In the thread on the topic in August, I also saw a lot of mentioning of > ISO-8859-1 being the default. I'm afraid that's rather > Western-Euro-centric view. Even though the current HTTP spec says It is, but it's nothing we can change. > ISO-8859-1 be the default, in practice, a lot of local legacy encodings > have been used by web servers/ web server-side programs (see > http://crbug.com/1148 ). The same is true of the old HTML spec, > according to which ISO-8859-1 is to be assumed in the absence of the > charset specification. However, in practice, web browsers allow users to > set the default charset to assume in such cases (and the default value > out of the box is locale-dependent. Korean Firefox/IE/Chrome have it set > to EUC-KR, Simplified Chinese versions have it set to GBK, French > versions have it set to ISO-8859-1/Windows-1252, etc). > > Unless I missed something, in the thread in last August, I couldn't find > discussion about what web servers (various server-side programs) emit. > As you know well, web browsers are rather on the receiving end (when it > comes to decoding Content-Disposition header) and have to support > formats in widespread use by web servers. > > I think a systematic survey of what's actually emitted by webservers in > the wild is necessary if it's not done yet. I meant to do this for a > while, but haven't managed to do it, yet. > ... Again, that data would be interesting. I imagine that server implementors that encounter this issue, such as web interfaces to document management systems that allow "arbitrary" filenames, usually end up: - ignoring the problem (as there's no interoperable answer), or - just support IE (I guess that's the case for Sharepoint), or - sniff the UA and special-case Internet Explorer. My understanding though is that using percent encoding or raw bytes in the message does not interoperate between user agents and breaks the spec. On the other hand, RFC 2231 based encoding can at least be deployed right now, as clients that do not understand the "name*" notation just ignore it (well, except Konqueror, but that's a different story). The transition would be smooth if UAs that support RFC 2231 encoding would, when both formats are present, select the RFC 2231 variant. Unfortunately that's not the case today (see <http://greenbytes.de/tech/tc2231/#attfnboth>). BR, Julian
Received on Friday, 24 July 2009 22:21:48 UTC