- From: Artur Janc <aaj@google.com>
- Date: Fri, 3 Dec 2021 18:03:11 +0100
- To: Samuel Weiler <weiler@w3.org>
- Cc: WebAppSec WG <public-webappsec@w3.org>
- Message-ID: <CAPYVjqo03Txv4FvqKz2js9vy3__OWUt27hAM-7XOws4yN2W6Ww@mail.gmail.com>
Hi Sam, Thanks for sharing the link to the slides and the dataset. I think for similar web-level analysis several folks have used HTTP Archive (https://httparchive.org/; see also https://httparchive.org/reports) in combination with browser telemetry (e.g. Chrome's UseCounters: https://chromium.googlesource.com/chromium/src/+/HEAD/docs/use_counter_wiki.md). But having another data source can definitely be useful and at least I wasn't aware of Common Crawl before. Cheers, -Artur On Thu, Dec 2, 2021 at 9:35 PM Samuel Weiler <weiler@w3.org> wrote: > James Richards at Nominet just did a DNS-related study using computed > metadata (WAT) from the Common Crawl datasets. > > The WAT format contains both HTTP headers as well as a catalog of all > links on the page. I wonder if this dataset might be useful for > estimating the effects of COEP and similar changes, perhaps in lieu of > or in advance of origin trials and similar live mechanisms. > > Info on the data format: > https://commoncrawl.org/the-data/get-started/#WAT-Format > > The presentation that led me here: > > https://indico.dns-oarc.net/event/40/contributions/886/attachments/842/1558/oarc-cc-presentation-james-richards.pdf > > -- Sam > >
Received on Friday, 3 December 2021 17:03:36 UTC