Re: CSP reports: `script-sample` from Neil Matatall on 2017-02-21 (public-webappsec@w3.org from February 2017)

From: Neil Matatall <oreoshake@github.com>
Date: Tue, 21 Feb 2017 19:39:52 +0000
To: Artur Janc <aaj@google.com>, Brad Hill <hillbrad@gmail.com>
Cc: Craig Francis <craig.francis@gmail.com>, Krzysztof Kotowicz <kkotowicz@gmail.com>, Mike West <mkwst@google.com>, Devdatta Akhawe <dev.akhawe@gmail.com>, "public-webappsec@w3.org" <public-webappsec@w3.org>, Christoph Kerschbaumer <ckerschbaumer@mozilla.com>, Frederik Braun <fbraun@mozilla.com>, Scott Helme <scotthelme@hotmail.com>, Lukas Weichselbaum <lwe@google.com>, Michele Spagnuolo <mikispag@google.com>, Jochen Eisinger <eisinger@google.com>
Message-ID: <CAASU7Q5QJNRE1aQCSwrA-68nLEt1ca9-BPaiJYdqLFvGid=45g@mail.gmail.com>
:wave: hello again friends,

I've been snoozing this thread for almost 6 months and I'd like to
resurrect this conversation after a recent twitter flurry of action and +1s
(https://twitter.com/mikewest/status/834081437473185792). I don't think
anyone has called out _why_ we need this other than "moar data" but I think
it's important to think about two use cases: report filtering and actually
capturing attack payloads. 5 years ago (
https://lists.w3.org/Archives/Public/public-webappsec/2012Dec/0012.html) I
thought script-sample was for capturing attack payloads. Today, I think
script-sample is more important for report filtering.

2.5 years ago the topic of "inline and eval reports look the same" was
discussed (https://github.com/w3c/webappsec/issues/52). This is a serious
problem for analyzing reports.

.... slight tangent on why filtering is so important

CSP reporting without script-sample is not very useful for filtering
garbage from plugins. Having script-sample data from firefox has been the
basis of filtering garbage. Without this data, we cannot filter that
specific type of garbage. Filtering out garbage is critical for building
the case for going from report-only to enforce mode. This logic is spread
out across the internet, e.g.
https://oreoshake.github.io/csp/twitter/2014/07/25/twitters-csp-report-collector-design.html
, https://blogs.dropbox.com/tech/2015/09/on-csp-reporting-and-filtering/,
etc. Funny enough, superfish.com was a standard "un-actionable report"
filter that has been passed down :) For every blog post on CSP reporting,
the conversation always devolves to one thing: how are you filtering
reports (https://mathiasbynens.be/notes/csp-reports). For every CSP
reporting collector, filtering is reimplemented (
https://github.com/jacobbednarz/go-csp-collector/blob/master/csp_collector.go#L58
 and
https://github.com/nico3333fr/CSP-useful/blob/61106b31683928a0f3dde3d312eaa257d0740914/report-uri/csp-parser-enhanced.php#L17
). Sentry even added CSP report collection, and guess what, they also do
filtering
https://github.com/getsentry/sentry/commit/0b9e124b702183a70002635ffd252e27d11fbe97
.

Scott Helme says he filters out 64% of his reports (though this number
includes things that are not csp reports). That's about what I can recall
from days where I had to collect reports. With script-sample from all
browsers, I suspect that number will jump up significantly since it will be
easier to identify plugins/other garbage.

... my point is

Most people implementing CSP expect the reports to be useful. Alas, they
are not without some serious filtering. Many are turned away by the high
quantity of "un-actionable" reports. While I'd be super happy if browser
vendors came up with magic to make filtering obsolete (all have acknowledge
plugin noise is a bug), I'd be happier if we could just do better filtering
today. Script sample helps me accomplish better filtering.

... but

Data leakage. This is not to be taken lightly. I think everything that can
be said about this already has. I also anecdotally agree that we're more
likely to leak something in an URL than in a script sample. The scripts
samples were crucial to filtering out lastpass violations which were easily
identifiable. Adding an opt-in flag while keeping the 40 character limit
would be great. Requiring TLD+1 matching would also be acceptable in my
mind.

JSONP. Ugh. It's still pretty widespread and it's unfortunate that the
internet is still terrible. I'm all for not breaking the web, but I'm more
for not being held back.







On Wed, Oct 19, 2016 at 9:49 AM Artur Janc <aaj@google.com> wrote:

> On Wed, Oct 19, 2016 at 7:14 PM, Brad Hill <hillbrad@gmail.com> wrote:
>
> Just to add my comment after discussion on today's teleconference:
>
> I'm sympathetic to the argument that "you must trust your reporting
> endpoint" for inline scripts and event handlers.
>
> I'm concerned about it for non-same origin external scripts.  There are
> lots of "guarded" JSONP like endpoints that I imagine have PII in the first
> 40 chars, e.g., something like the following where u is a userid:
>
> for(;;);{u:21482823,  ...
>
> Allowing any cross-origin endpoint to request this response and dump it to
> a reporting endpoint of their choice would be very bad.
>
>
> Yes, totally! What we were talking about earlier is reporting script
> samples for "inline" script violations such as event handlers and inline
> <script> blocks -- this is data which the developer can already access by
> inspecting the DOM (though in some cases it can be difficult if the element
> has already been removed). The main benefit of having it done by the user
> agent is robustness and not requiring the developer to load a "CSP
> debugging" library everywhere on their site.
>
> FWIW I'd be strongly opposed to doing this for external scripts because of
> the problem you're talking about. Luckily, we generally don't need this for
> external scripts because they can already be identified by the blocked-uri,
> unlike inline violations.
>
>
> -Brad
>
> On Wed, Oct 19, 2016 at 8:21 AM Craig Francis <craig.francis@gmail.com>
> wrote:
>
> On 19 Oct 2016, at 15:34, Krzysztof Kotowicz <kkotowicz@gmail.com> wrote:
>
> URLs should not contain sensitive data, precisely because they are often
> relayed to third parties
>
>
>
> Hi Krzysztof,
>
> While I agree, the classic example is a password reset.
>
> As in, a link is sent to the user via email, and that link contains a
> sensitive token that allows someone to change the account password.
>
> That said, the token should not last long to deduce the risk of it being
> exposed to 3rd parties (e.g. expiring it after use, or a certain period of
> time).
>
> Craig
>
>
>
>
>
>
> On 19 Oct 2016, at 15:34, Krzysztof Kotowicz <kkotowicz@gmail.com> wrote:
>
>
>
> 2016-10-19 2:16 GMT+02:00 Artur Janc <aaj@google.com>:
>
> On Tue, Oct 18, 2016 at 10:05 AM, Mike West <mkwst@google.com> wrote:
>
> On Tue, Oct 18, 2016 at 1:03 AM, Artur Janc <aaj@google.com> wrote:
>
> On Mon, Oct 17, 2016 at 7:15 PM, Devdatta Akhawe <dev.akhawe@gmail.com>
> wrote:
>
> Hey
>
> In the case of a third-party script having an error, what are example
> leaks you are worried about?
>
>
> The same kinds of issues that lead us to sanitize script errors for things
> loaded as CORS cross-origin scripts:
> https://html.spec.whatwg.org/#muted-errors. If the resource hasn't
> opted-in to being same-origin with you, script errors leak data you
> wouldn't otherwise have access to.
>
>
> Thanks for the summary, Mike! It's a good overview of the issue, but I'd
> like to expand on the reasoning for why including the prefix of an inline
> script doesn't sound particularly scary to me.
>
>
> Thanks for fleshing out the counterpoints, Artur!
>
>
> Basically, in order for this to be a concern, all of the following
> conditions need to be met:
>
> 1. The application has to use untrusted report collection infrastructure.
> If that is the case, the application is already leaking sensitive data from
> page/referrer URLs to its collector.
>
>
> "trusted" to receive URLs doesn't seem to directly equate to "trusted" to
> store sensitive data. If you're sure that you don't have sensitive data on
> your pages, great. But you were also presumably "sure" that you didn't have
> inline script on your pages, right? :)
>
>
> Keep in mind that URLs are sensitive data for most applications and they
> are currently being sent in violation reports.
>
>
> URLs should not contain sensitive data, precisely because they are often
> relayed to third parties in e.g. referrer. If they do, that's usually
> considered a vulnerability by e.g. OWASP, especially if the URL is
> capability bearing (IDOR
> <https://www.owasp.org/index.php/Top_10_2013-A4-Insecure_Direct_Object_References>
> is even in OWASP Top 10). I agree that some applications disclose sensitive
> information in URLs, but that's should not be the majority of them. I think
> that for a lot of applications URLs reported through e.g. CSP violation
> reports are still subject to regular access control, whereas we just don't
> know yet if sensitive tokens are not included in script samples. It's
> likely some IDs are present in the inline scripts <a href=#
> oncllick=a.select(123)>
>
> The password reset example Craig is talking about is a fairly ubiquitous
> feature which generally needs to pass sensitive data in the URL for
> compatibility with email clients. Also, capability URLs are a thing in a
> lot of apps (e.g. Google Docs), and so are various features which reveal
> the identity of the current user (/profile/koto).
>
> But even without direct leaks of PII and SPII, URLs contain a lot of data
> about the current user, the data they have in the given application, and
> their interactions with the app. For example, a search engine is likely to
> disclose your queries, a social network will leak the IDs of your friends
> and groups you belong to, a mapping site will have your location, and even
> a news site will disclose the articles you read, which are sensitive in
> certain contexts.
>
> This is difficult to definitively prove, but I'd say that claiming that an
> application doesn't have any interesting/sensitive data in its URLs would
> be an exception rather than the norm. But I'm not sure how to convince you
> other than letting you pick some interesting apps and trying to find
> interesting stuff in their URLs ;-)
>
> Cheers,
> -A
>
>
>
>
> I'm having a difficult time imagining a case where an application is okay
> with disclosing their URLs to a third-party for the purpose of debugging
> violation reports, and is not okay with disclosing script prefixes for the
> same purpose, given that:
> 1) Almost all applications have sensitive data in URLs, compared to a
> certainly real, but less specific risk of having inline scripts with
> sensitive data in its prefix, assuming it's limited to a reasonable length.
>
>
> Citation needed (for the "almost all" claim). I agree the risk of leaking
> sensitive date might be mitigated by adding a reasonable length limit.
>
> 2) URLs are disclosed much more frequently than script samples would be,
> because they are sent with every report (not just "inline" script-src
> violations). In the `referrer` field, the UA is also sending a URL of
> another, unrelated page, increasing the likelihood that sensitive data will
> appear in the report.
>
>
> Which is why it's a best practice not to have sensitive data in URLs, but
> instead e.g. using cookies or POST parameters to transfer them.
>
>
> 3) There is no sanitization of URL parameters in violation reports,
> compared to the prefixing logic we're considering for script samples.
>
>
> In fact, I'd be much more worried about URLs than script prefixes, because
> URLs leak on *any* violation (not just for script-src) and URLs frequently
> contain PII or authorization/capability-bearing tokens e.g for password
> reset functionality.
>
>
> We've talked a bit about URL leakage in
> https://github.com/w3c/webappsec-csp/issues/111. I recall that Emily was
> reluctant to apply referrer policy to the page's URL vis a vis the
> reporting endpoint, but I still think it might make sense.
>
>
> 2. The application needs to have a script which includes sensitive user
> data somewhere in the first N characters. FWIW in our small-scale analysis
> of a few hundred thousand reports we saw ~300 inline script samples sent
> by Firefox (with N=40) and haven't found sensitive tokens in any of the
> snippets.
>
>
> Yup. I'm reluctant to draw too many conclusions from that data, given the
> pretty homogeneous character of the sites we're currently applying CSP to
> at Google, but I agree with your characterization of the data.
>
> Scott might have more data from a wider sampling of sites, written by a
> wider variety of engineering teams (though it's not clear that the terms of
> that site would allow any analysis of the data).
>
>
> I completely agree, this data is just what we had readily available -- we
> can certainly do a much larger analysis of script prefixes based on the
> search index. That said, if we're worried about the script-sample approach,
> perhaps not seeing any sensitive data in the first dataset we looked at
> could be a signal that it's worth pursuing further.
>
> 3. The offending script needs to cause a CSP violation, i.e. not have a
> valid nonce, meaning that the application is likely broken if the policy is
> in enforcing mode.
>
>
> 1. Report mode exists.
>
> 2. Embedded enforcement might make it more likely that XSS on a site could
> cause policy to be inadvertantly applied to itself or its dependencies. We
> talked about this briefly last week, and I filed
> https://github.com/w3c/webappsec-csp/issues/126 to ponder it. :)
>
>
> Since CSPs applied by embedded enforcement serve a very different purpose
> than current policies (they don't try to mitigate script injection), it
> would very likely be okay to just not include script-sample data for such
> policies. Also, embedded enforcement is still pretty far off, and the
> reporting problem is an issue for pretty much every site currently
> gathering violation reports; we should probably weigh the value of fixing
> CSP reporting accordingly.
>
>
> As a security engineer, I would consider #1 to be the real security
> boundary -- a developer should use a CSP collector she trusts because
> otherwise, even without script-sample, reports contain data that can
> compromise the application.
>
>
> That sounds like an argument for reducing the amount of data in reports,
> not for increasing it. I think it's somewhat rational to believe that
> reporting endpoints are going to have longer retention times and laxer
> retention policies than application databases. Data leaking from the latter
> into the former seems like a real risk. I agree that the URL itself already
> presents risks, but I don't understand how that's a justification for
> accepting more risk.
>
>
> It is an argument for using trusted infrastructure when building your
> application ;-) Developers are already accustomed to deciding whether to
> place trust in various components of their apps, whether it's the hosting
> platform and OS, server-side modules and libraries, or JS widgets and other
> embedded resources. A CSP violation endpoint is currently a
> security-critical part of an application because it receives URLs; people
> who don't trust their collection infrastructure already have insecure
> applications and adding script-sample to reports does little to change
> this. (Note that this wouldn't hold for applications which have nothing
> sensitive in URLs and embed sensitive data at the beginning of inline
> scripts, but this doesn't seem like a common pattern.)
>
> Basically, the reluctance to include relevant debugging information in the
> violation report seems to be somewhat of a misplaced concern to me, because
> it ignores the trust relationship the application owner must already have
> with their report collection endpoint.
>
> Perhaps it's pertinent to take a step back and think about the reason to
> have reporting functionality in CSP in the first place -- after all, the
> mechanism could certainly work only via throwing SecurityPolicyViolation
> events and requiring developers to write their own logging code. The fact
> that this capability exists in UAs, and is not restricted to sending
> reports to the same origin or same "base domain" (contrary to the original
> proposals, e.g. in http://research.sidstamm.com/papers/csp-www2010.pdf)
> indicates that CSP wants to be flexible and give developers ultimate
> control over the reporting functionality. Given this design choice, it
> seems okay to trust the developer to pick the right report URI for their
> application and include useful debugging data if the developer wants it; in
> a way, the status quo is the worst of both worlds, because it already
> requires the developer to fully trust the collector, but doesn't give her
> enough useful data to track down causes of violations.
>
> In case it helps: Lukas ran a quick analysis of the report-uri values
> we've seen in the wild, and e.g. for the domains with CSP in Alexa 100,000
> we see the following:
> - 49% don't set a report-uri
> - 29% have a report-uri pointing to a relative path (/foo)
> - 10% have a report-uri pointing to the same origin, with another 1% using
> a sibling subdomain (foo.example.org reporting to csp.example.org)
>
> Out of the remaining ~10% which send violations to external URLs, about
> half point to report-uri.io and a couple of other logging services, and
> the rest seems to use another domain owned by the same person/organization,
> e.g. vine.co sends reports to twitter.com. The data for all domains in
> our set isn't substantially different (66% without report-uri; 24%
> reporting to own domain; 10% externally). This data doesn't include all the
> Google ccTLDs and a couple of other big providers, and I'm sure it's
> missing some other domains, e.g. ones with CSP in parts of the site
> requiring authentication, but AFAIK it shouldn't have a systematic bias
> otherwise.
>
> I can easily imagine scripts that violate conditions #2 and #3, but at the
> same time we have not seen many examples of such scripts so far, nor have
> people complained about the script-sample data already being included by
> Firefox (AFAIK).
>
>
> People are generally unlikely to complain about getting more data,
> especially when the data's helpful and valuable. That can justify pretty
> much anything, though: lots of people think CORS is pretty restrictive, for
> instance, and probably wouldn't be sad if we relaxed it in various ways.
>
>
> Overall, I don't see the gathering of script samples as qualitatively
> different to the collection of URLs. However, if we are indeed particularly
> worried about script snippets, we could make this opt-in and enable the
> functionality only in the presence of a new keyword (report-uri /foo
> 'report-script-samples') and add warnings in the spec to explain the
> pitfalls. This way even if I'm wrong about all of the above we would not
> expose any data from existing applications.
>
>
> I suspect that such an option would simply be copy-pased into new
> policies, but yes, it seems like a reasonable approach.
>
>
> For some background about why we're even talking about this: currently
> violation reports are all but useless for both debugging and detection of
> the exploitation of XSS due to the noise generated by browser extensions.
>
>
> I agree that this is a problem that we should solve. One way of solving it
> is to add data to the reports. Another is to invest more in cleaning up the
> reports that you get so that there's less noise. I wish browser vendors
> (including Chrome) spent more time on the latter, as we're actively harming
> users by not doing so.
>
>
> Yes, fixing the blocking and reporting of extension-injected scripts would
> certainly help (although note that "less noise" likely isn't sufficient, it
> really has to be zero noise), but IIRC prior discussions we've had about
> the topic indicated that this is an almost intractable problem, so it would
> be great to find alternative solutions.
>
> The script sample approach also has several important advantages because
> even without extension-related false positives, developers would have very
> little information about the actual cause of inline script violations
> (which are the majority of possible CSP problems in nonce-based policies).
> Sending some of the script text not only makes it possible to discard all
> spurious reports, but also gives the developer the crucial bit of data to
> find and fix actual site errors; it seems like a workable solution to the
> current reporting problems faced by many sites.
>
> Cheers,
> -Artur
>
>
>
>
> --
> Best regards,
> Krzysztof Kotowicz
>
>
>
Received on Tuesday, 21 February 2017 19:40:45 UTC