Re: Content-Disposition next steps from 신정식, 申政湜 on 2010-12-16 (ietf-http-wg@w3.org from October to December 2010)

From: 신정식, 申政湜 <jungshik@google.com>
Date: Thu, 16 Dec 2010 10:19:02 -0800
To: Adam Barth <ietf@adambarth.com>
Cc: Maciej Stachowiak <mjs@apple.com>, Julian Reschke <julian.reschke@gmx.de>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <AANLkTikbzVWYE09oo5UVEskbPxjQgGZFrQ9w4s-TWovq@mail.gmail.com>
On Mon, Dec 13, 2010 at 3:00 AM, Adam Barth <ietf@adambarth.com> wrote:

> On Mon, Dec 13, 2010 at 2:54 AM, Maciej Stachowiak <mjs@apple.com> wrote:
> > I wanted to add, for clarity, that we don't see any major problems in
> Adam's proposal as-is, though we'd suggest adding ISO-8859-1 as a final
> fallback.
>
> Done.
>
> > If there is a test suite that matches the expectations of Adam's proposal
> and is easy to run, I'll try to get someone to run it. Or if this testing
> has already been done, I can comment on the ways it diverges from Safari
> behavior and whether we are likely to care.
>
> Julian has written a nice test suite.  We'll just need to set the
> expectations.
>
> > On Dec 13, 2010, at 1:28 AM, Julian Reschke wrote:
> >
> >> Hi Maciej,
> >>
> >> thanks for forwarding.
> >>
> >> On 13.12.2010 10:06, Maciej Stachowiak wrote:
> >>> Here are some comments from my colleague Alexey Proskuryakov on your
> >>> proposal. I know these may have been outpaced by the considerable
> >>> discussion since that point, but they still seem like they could be
> useful.
> >>>
> >>>> I only know about file name decoding - all parsing is of course in
> >>>> CFNetwork, and most logic is in Launch Services, I think.
> >>>>
> >>>> Adam's proposal is a step forward in that it acknowledges the need to
> >>>> process raw non-ASCII bytes in filename, which is the only encoding
> >>
> >> That's incorrect in that the base spec already says it's ISO-8859-1
> (although this may be hard to find in the published specs as opposed to the
> Internet Draft we're discussing).
> >>
> >> (maybe this is a case where Alexey looked at an old proposal?)
> >
> > I suspect he looked at the existing published RFC. In any case, treating
> everything as Latin1 is likely not acceptable to us. We came up with our
> (somewhat complicated) rule through a long process of trial and error based
> on bug reports and user requirements.
>
>
Yes, just assuming ISO-8859-1 is rather Western-Europe-centric (it's like
assuming ISO-8859-1 for unlabelled HTML pages. ISO-8859-1 is supposed to be
assumed for unlabelled HTML pages, but none of major web browsers did that
because that simply does not work for non-Western-European web pages) and is
not acceptable for web compatibility. There are way too many web sites that
emit raw 8-byte sequences in other encodings (that are usually in the
encoding used by 'referring' pages, but not always).




> The rule in my proposal isn't quite as complicated as what Safari
> implements.  In particular, the proposal doesn't take the current
> frame encoding into account.
>

Huh? In your initial proposal, you mentioned trying 'referrer charset' after
UTF-8. I've just checked your latest draft and that stop was dropped.

I think it does not reflect the reality. As I wrote above, there are way too
many web servers that just emit raw 8-bit byte sequences in 'legacy'
(non-UTF-8) encodings. Their numbers will get smaller as more pages switch
to UTF-8, but it'll be a long while before they're all gone (either
switching to UTF-8 and emitting raw UTF-8 bytes or becoming compliant to RFC
5987, the former of which is more likely for 'small' web sites).


Chrome's implementation has an argument for that and it works most of time.
Firefox has done the same for almost a decade and its implementation works
in more cases). 'referrer charset' is meant to be the character encoding of
a frame that refers to a file being downloaded. Even when that fails, one
can finally resorts to the legacy character encoding most widely used for
the UI language of a UA.

I saw, in Webkit tree that  Safari passes a list of fallback charsets
(including that of the frame charset) to try, but the last time I tried it
with my test pages, it didn't seem to work as expected. That may have
changed since.

In case of IE, depending on the default system code page of Windows where IE
is running, the interpretation of raw byte sequences  is different. A lot of
Chinese and Korean web sites can get away with emitting raw-byte sequences
in Windows-932 (GBK), Windows-950 (Big5), and Windows-949 (extended EUC-KR)
because IE is dominant in those markets (and falling back to the referer
charset supported by Firefox works most of time). Of course, this does not
work if IE runs on WIndows with its default legacy code page set to one
different from that assumed by web sites.  (e.g. IE on Windows with the
default codepage set to windows-1252). To test this, one can change the
default OS (legacy) codepage on Windows (in Control Panel - Regional and
language settings - Advanced - "languages to use for non-Unicode
application"), after which Windows has to be rebooted.




> >>>> style that matters. He also describes the proper algorithm,
> >>>> acknowledging that Chrome doesn't fully implement it. Unsurprisingly,
> >>>> that part was met with resistance from the "we always told you it was
> >>>> ISO-8859-1" crowd.
> >>>>
> >>>> I agree that RFC2047 style encoding shouldn't be supported, and I'm
> >>>> ambivalent about RFC5987. RFC2231/5987 is a step in the wrong
> >>>> direction (opaque encoding for something that doesn't need it), but
> >>>> given that IETF won't cease pushing it, we might as well implement it
> >>>> and be more compatible with Firefox, if not the Web.
> >>>>
> >>>> - WBR, Alexey Proskuryakov
> >>
> >> In a perfect world we could declare that the HTTP header encoding is
> UTF-8. But it isn't.
> >>
> >> If we *tried* to change the default encoding just for C-D/filename, we
> will still break existing code.
> >>
> >> So I'm not sure what the "doesn't need it" refers to.
> >
> > Treating filenames in Content-Disposition as Latin1 will in any case
> break existing code. I think the WG's only choices here are which code to
> break, and by how much.
>
> I've added back UTF-8 as the first encoding to try based on your and
> Alexey's input.
>
>
I think the charset of a referer has to be tried after UTF-8.

Jungshik


> Adam
>
>
Received on Thursday, 16 December 2010 18:19:32 UTC