- From: 신정식, 申政湜 <jungshik@google.com>
- Date: Fri, 24 Jul 2009 11:37:52 -0700
- To: HTTP Working Group <ietf-http-wg@w3.org>
- Message-ID: <72299d310907241137o52072786o4bf535c3b3f54008@mail.gmail.com>
Hi, I'm sorry to 'resurrect' the old topic. I skimmed over the thread last August (with 50+ emails), but I'm not sure of the conclusion although it seems that the WG leaned toward RFC 2231 without continuation line support. I'm writing this hoping that it's still not too late to give Chrome/Firefox-side of the story. When Chrome was released in September, there is a "bug" report on Chrome about decoding %-encoded UTF-8. I replied that it's not a bug per se because Chrome just tried to be compatible with many formats emitted by web servers in the wild. Then, I wanted to join this mailing list and explain what Chrome tries to do. I wrote the filename parameter handling code for both* *Firefox (necko) and Chrome. (the test cases I used are at http://www.i18nl10n.com/moztest/download.html ) There's a difference between two as noticed by some of you back in September. The most prominent is that Firefox support RFC 2231 (even multi-line ones) while Chrome does not. Firefox supports the full form of RFC 2231 not because I found that there's any web server emitting RFC 2231-style C-D header (which does not mean that there is not any, obviously) but because that part of the code is shared with Thunderbird. For email clients, it's essential that RFC 2231 is supported (including continuation line and 'lang' param). When I wrote the C-D handling code for Chrome, I didn't include RFC 2231 support because I haven't seen any increase in the use of RFC 2231 format by web servers in the wild. Again, this is not backed by any systematic survey. Another is that Chrome does decode %-encoded byte sequences (UTF-8 only at the moment, but given the IE's behavior and a lot of web servers that depend on it especially in East Asia [1], it seems that I have to expand that to cover non-UTF-8 %-encoded strings as well. There are also common behaviors, too. Both supports RFC 2047 and both support raw-8byte-sequences in UTF-8. If a raw 8byte sequence cannot be interpreted as UTF-8, both try to interpret it as the encoding of a referring page (Firefox implementation is more complete while Chrome's only works with 'Save As'). [2] Safari tries to do something similar for raw 8bit byte sequences, but currently it does not work. Note that neither tries to interpret it as ISO-8859-1/Windows-1252, which leads to my next point. In the thread on the topic in August, I also saw a lot of mentioning of ISO-8859-1 being the default. I'm afraid that's rather Western-Euro-centric view. Even though the current HTTP spec says ISO-8859-1 be the default, in practice, a lot of local legacy encodings have been used by web servers/ web server-side programs (see http://crbug.com/1148 ). The same is true of the old HTML spec, according to which ISO-8859-1 is to be assumed in the absence of the charset specification. However, in practice, web browsers allow users to set the default charset to assume in such cases (and the default value out of the box is locale-dependent. Korean Firefox/IE/Chrome have it set to EUC-KR, Simplified Chinese versions have it set to GBK, French versions have it set to ISO-8859-1/Windows-1252, etc). Unless I missed something, in the thread in last August, I couldn't find discussion about what web servers (various server-side programs) emit. As you know well, web browsers are rather on the receiving end (when it comes to decoding Content-Disposition header) and have to support formats in widespread use by web servers. I think a systematic survey of what's actually emitted by webservers in the wild is necessary if it's not done yet. I meant to do this for a while, but haven't managed to do it, yet. Thank you for reading my email, Jungshik [1] http://markmail.org/message/yi4lzgjf4rhby7k7#query:http-wg%20content-disposition+page:1+mid:xqygbmcwo4b44jrm+state:results http://crbug.com/17676 [2] http://crbug.com/1148
Received on Friday, 24 July 2009 18:38:33 UTC