Re: Content-Disposition next steps from Adam Barth on 2010-12-16 (ietf-http-wg@w3.org from October to December 2010)

From: Adam Barth <ietf@adambarth.com>
Date: Thu, 16 Dec 2010 12:10:04 -0800
To: Jungshik Shin (신정식, 申政湜) <jungshik@google.com>
Cc: Maciej Stachowiak <mjs@apple.com>, Julian Reschke <julian.reschke@gmx.de>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <AANLkTinwRL6x+R+gnFf-6wX=5HYh0zC3LaDGCvaNEzXf@mail.gmail.com>
2010/12/16 Jungshik Shin (신정식, 申政湜) <jungshik@google.com>:
> On Mon, Dec 13, 2010 at 3:00 AM, Adam Barth <ietf@adambarth.com> wrote:
>> On Mon, Dec 13, 2010 at 2:54 AM, Maciej Stachowiak <mjs@apple.com> wrote:
>> > I wanted to add, for clarity, that we don't see any major problems in
>> > Adam's proposal as-is, though we'd suggest adding ISO-8859-1 as a final
>> > fallback.
>>
>> Done.
>>
>> > If there is a test suite that matches the expectations of Adam's
>> > proposal and is easy to run, I'll try to get someone to run it. Or if this
>> > testing has already been done, I can comment on the ways it diverges from
>> > Safari behavior and whether we are likely to care.
>>
>> Julian has written a nice test suite.  We'll just need to set the
>> expectations.
>>
>> > On Dec 13, 2010, at 1:28 AM, Julian Reschke wrote:
>> >
>> >> Hi Maciej,
>> >>
>> >> thanks for forwarding.
>> >>
>> >> On 13.12.2010 10:06, Maciej Stachowiak wrote:
>> >>> Here are some comments from my colleague Alexey Proskuryakov on your
>> >>> proposal. I know these may have been outpaced by the considerable
>> >>> discussion since that point, but they still seem like they could be
>> >>> useful.
>> >>>
>> >>>> I only know about file name decoding - all parsing is of course in
>> >>>> CFNetwork, and most logic is in Launch Services, I think.
>> >>>>
>> >>>> Adam's proposal is a step forward in that it acknowledges the need to
>> >>>> process raw non-ASCII bytes in filename, which is the only encoding
>> >>
>> >> That's incorrect in that the base spec already says it's ISO-8859-1
>> >> (although this may be hard to find in the published specs as opposed to the
>> >> Internet Draft we're discussing).
>> >>
>> >> (maybe this is a case where Alexey looked at an old proposal?)
>> >
>> > I suspect he looked at the existing published RFC. In any case, treating
>> > everything as Latin1 is likely not acceptable to us. We came up with our
>> > (somewhat complicated) rule through a long process of trial and error based
>> > on bug reports and user requirements.
>
> Yes, just assuming ISO-8859-1 is rather Western-Europe-centric (it's like
> assuming ISO-8859-1 for unlabelled HTML pages. ISO-8859-1 is supposed to be
> assumed for unlabelled HTML pages, but none of major web browsers did that
> because that simply does not work for non-Western-European web pages) and is
> not acceptable for web compatibility. There are way too many web sites that
> emit raw 8-byte sequences in other encodings (that are usually in the
> encoding used by 'referring' pages, but not always).
>
>> The rule in my proposal isn't quite as complicated as what Safari
>> implements.  In particular, the proposal doesn't take the current
>> frame encoding into account.
>
> Huh? In your initial proposal, you mentioned trying 'referrer charset' after
> UTF-8. I've just checked your latest draft and that stop was dropped.
> I think it does not reflect the reality. As I wrote above, there are way too
> many web servers that just emit raw 8-bit byte sequences in 'legacy'
> (non-UTF-8) encodings. Their numbers will get smaller as more pages switch
> to UTF-8, but it'll be a long while before they're all gone (either
> switching to UTF-8 and emitting raw UTF-8 bytes or becoming compliant to RFC
> 5987, the former of which is more likely for 'small' web sites).
>
> Chrome's implementation has an argument for that and it works most of time.
> Firefox has done the same for almost a decade and its implementation works
> in more cases). 'referrer charset' is meant to be the character encoding of
> a frame that refers to a file being downloaded. Even when that fails, one
> can finally resorts to the legacy character encoding most widely used for
> the UI language of a UA.
> I saw, in Webkit tree that  Safari passes a list of fallback charsets
> (including that of the frame charset) to try, but the last time I tried it
> with my test pages, it didn't seem to work as expected. That may have
> changed since.
> In case of IE, depending on the default system code page of Windows where IE
> is running, the interpretation of raw byte sequences  is different. A lot of
> Chinese and Korean web sites can get away with emitting raw-byte sequences
> in Windows-932 (GBK), Windows-950 (Big5), and Windows-949 (extended EUC-KR)
> because IE is dominant in those markets (and falling back to the referer
> charset supported by Firefox works most of time). Of course, this does not
> work if IE runs on WIndows with its default legacy code page set to one
> different from that assumed by web sites.  (e.g. IE on Windows with the
> default codepage set to windows-1252). To test this, one can change the
> default OS (legacy) codepage on Windows (in Control Panel - Regional and
> language settings - Advanced - "languages to use for non-Unicode
> application"), after which Windows has to be rebooted.
>
>> >>>> style that matters. He also describes the proper algorithm,
>> >>>> acknowledging that Chrome doesn't fully implement it. Unsurprisingly,
>> >>>> that part was met with resistance from the "we always told you it was
>> >>>> ISO-8859-1" crowd.
>> >>>>
>> >>>> I agree that RFC2047 style encoding shouldn't be supported, and I'm
>> >>>> ambivalent about RFC5987. RFC2231/5987 is a step in the wrong
>> >>>> direction (opaque encoding for something that doesn't need it), but
>> >>>> given that IETF won't cease pushing it, we might as well implement it
>> >>>> and be more compatible with Firefox, if not the Web.
>> >>>>
>> >>>> - WBR, Alexey Proskuryakov
>> >>
>> >> In a perfect world we could declare that the HTTP header encoding is
>> >> UTF-8. But it isn't.
>> >>
>> >> If we *tried* to change the default encoding just for C-D/filename, we
>> >> will still break existing code.
>> >>
>> >> So I'm not sure what the "doesn't need it" refers to.
>> >
>> > Treating filenames in Content-Disposition as Latin1 will in any case
>> > break existing code. I think the WG's only choices here are which code to
>> > break, and by how much.
>>
>> I've added back UTF-8 as the first encoding to try based on your and
>> Alexey's input.
>
> I think the charset of a referer has to be tried after UTF-8.
> Jungshik

Thanks Jungshik.  I'm very glad you're participating in this discussion.

I suspect adding the Referer charset to the charset array will make
various members of this working group sad, but I'd rather we write an
honest specification than optimize for the happiness of the working
group.

I'll make this change next time I edit the wiki.  Before Julian and
others bite my head off, I'm aware that this will likely mean you
won't find this text acceptable to include in the current document.
If that's the case, I can expand this text to a full-length
specification, and we can publish it either through this working group
or as an individual submission.

Adam
Received on Thursday, 16 December 2010 20:11:15 UTC