Re: Common Crawl for testing effect of COEP/etc. rollouts? from Artur Janc on 2021-12-03 (public-webappsec@w3.org from December 2021)

From: Artur Janc <aaj@google.com>
Date: Fri, 3 Dec 2021 18:03:11 +0100
To: Samuel Weiler <weiler@w3.org>
Cc: WebAppSec WG <public-webappsec@w3.org>
Message-ID: <CAPYVjqo03Txv4FvqKz2js9vy3__OWUt27hAM-7XOws4yN2W6Ww@mail.gmail.com>

Hi Sam,

Thanks for sharing the link to the slides and the dataset.

I think for similar web-level analysis several folks have used HTTP Archive
(https://httparchive.org/; see also https://httparchive.org/reports) in
combination with browser telemetry (e.g. Chrome's UseCounters:
https://chromium.googlesource.com/chromium/src/+/HEAD/docs/use_counter_wiki.md).
But having another data source can definitely be useful and at least I
wasn't aware of Common Crawl before.

Cheers,
-Artur

On Thu, Dec 2, 2021 at 9:35 PM Samuel Weiler <weiler@w3.org> wrote:

> James Richards at Nominet just did a DNS-related study using computed
> metadata (WAT) from the Common Crawl datasets.
>
> The WAT format contains both HTTP headers as well as a catalog of all
> links on the page.  I wonder if this dataset might be useful for
> estimating the effects of COEP and similar changes, perhaps in lieu of
> or in advance of origin trials and similar live mechanisms.
>
> Info on the data format:
> https://commoncrawl.org/the-data/get-started/#WAT-Format
>
> The presentation that led me here:
>
> https://indico.dns-oarc.net/event/40/contributions/886/attachments/842/1558/oarc-cc-presentation-james-richards.pdf
>
> -- Sam
>
>

Received on Friday, 3 December 2021 17:03:36 UTC