\-decoding filename parameters [was: TICKET 259: 'treat as invalid' not defined] from Mark Nottingham on 2011-02-03 (ietf-http-wg@w3.org from January to March 2011)

From: Mark Nottingham <mnot@mnot.net>
Date: Thu, 3 Feb 2011 14:49:58 +1100
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Adam Barth <ietf@adambarth.com>, httpbis <ietf-http-wg@w3.org>
Message-Id: <C47DCEA0-C9AC-467D-9B20-72FBF0C3D826@mnot.net>
Having had some time away from this, and re-reading the thread, I notice that HTTPbis already disallows the production of many \-encoded characters:

   Producers SHOULD NOT escape characters that do not require escaping
   (i.e., other than DQUOTE and the backslash character).

(p1, 1.2.2)

So, really, we're already talking about error handling when we talk about things like \b, etc.

Furthermore, AFAICT neither 2616 nor bis really talk about the semantics and handling of "unusual" \-encoded characters. According to the current definition, one could plausibly decode "\n" to be a newline, since all that's really said about it is 

   The backslash character ("\") MAY be used as a single-character
   quoting mechanism only within quoted-string and comment constructs.

(although I doubt that happens IRL).

At the least, then, we should continue to discourage the use of \-escaping for things other than "\" and <">. 

If we were to write error-handling advice for it, it seems that we could give *weak* advice to replace with "_" or "-" (based upon <http://greenbytes.de/tech/tc2231/#attwithasciifnescapedchar>). We should probably also consider that question for BIS, but that can wait for now.

To me the currently relevant question is whether implementers will eventually support escaping "\" and <">. Right now a few do (see <http://greenbytes.de/tech/tc2231/#attwithasciifnescapedquote>), but many don't. However, since this is a "soft" failure / interop problem (i.e., it affects how a file is named when saved on disk, but doesn't prevent it from being saved or named), I don't see that as a reason to not specify it.

Thoughts?


On 13/12/2010, at 7:36 PM, Julian Reschke wrote:

> On 13.12.2010 00:36, Adam Barth wrote:
>> ...
>> The only browsers in his test that \-decode the filename parameter are
>> Opera and Konquerer.  The other browsers, representing some 99% of the
>> market, do not \-decode.
>> 
>> Based on this data lone, I'd be quite hesitant to implement \-decoding
>> in my hypothetical user agent.  Worse, the \ character is actually
>> quite commonly used in file paths because it is the path separator on
>> Windows.  It seems entirely likely that some number of servers send
>> absolute paths rather than file names,
>> 
>> Content-Disposition: attachment; filename="C:\foo\newsheet.html"
>> 
>> or even just relative paths, like "foo\bar.html".
>> 
>> Now, I haven't gone out and measured the prevalence of the \ character
>> in the headers on the web.  If some would like to run that experiment,
>> I'd certainly be open to considering that data.  However, in the
>> absence of such data, I think it's unlikely that browsers will change
>> their behavior.
>> ...
> 
> We are going in circles, and have been for months now.
> 
> We discussed this just two 10 days ago. See <http://lists.w3.org/Archives/Public/ietf-http-wg/2010OctDec/0536.html>, where I said:
> 
> -- snip --
> "Fixing" means "changing things to work as specified".
> 
> So the question here is whether it would break things because there are
> servers sending unescaped backslashes. As far as I can tell, sending
> path separators in the filename indicates a bug in the sender, or an
> attempt to trick the user agent to do something it's not supposed to do.
> 
> So the "harm" of actually doing the unescaping would be that for a
> filename that needs to be postprocessed anyway, the problematic
> character would be filtered in a different way.
> 
> Starting with
> 
>   filename="a\bc"
> 
> the broken implementation sees "a" and "bc" separated by a path
> separator, and will prost-process this to "abc", "a_bc" or "bc" (where _  could be a different replacement character).
> 
> A correct implementation sees "abc".
> 
> I don't think there's a problem here.
> -- snip --
> 
> You did reply to that in <http://lists.w3.org/Archives/Public/ietf-http-wg/2010OctDec/0538.html> with
> 
> "None of the user agents do \-decoding.  I don't see any value in them starting."
> 
> ...which I quite frankly didn't find helpful. It appears that we have differing opinions on whether it's useful to have consistent handling of specific syntactical constructs.
> 
> I believe having different parsers for quoted-string depending on the context they appear in is both a bad idea in general, and also not *needed* here.
> 
>> Finally, I don't think we should require user agents to implement
>> behavior we have reason to believe will not be implemented by a number
>> of major implementations.  Every time we do that, we take one more
> 
> On the other hand, we also shouldn't require user agents to break the spec that has been around for ~10 years when there's no compelling reason to do so.
> 
>> step down the path to irrelevance.  I'd like to see more engagement
>> between the browser vendors and the IETF, not less.
> 
> Yes, so do I. That's why I'd like to hear from the other browser makers (not necessarily vendors), and also observe how far we'll get with fixing the C-D bugs in Firefox once they are past the FF4 release.
> 
>> ...
> 
> Best regards, Julian
> 

--
Mark Nottingham   http://www.mnot.net/
Received on Thursday, 3 February 2011 03:50:32 UTC