- From: Mark Nottingham <mnot@mnot.net>
- Date: Mon, 7 Jan 2013 11:37:20 +1100
- To: Ilya Grigorik <ilya@igvita.com>
- Cc: Roberto Peon <grmocg@gmail.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>
On 07/01/2013, at 7:34 AM, Ilya Grigorik <ilya@igvita.com> wrote: > On Sun, Jan 6, 2013 at 1:55 AM, Roberto Peon <grmocg@gmail.com> wrote: > Do you have some suggestions Martin? > The obvious thing in my mind is to get submissions from site owners, but that takes interest on their part first. :/ > > HTTP Archive is now scanning ~300K top domains (at least according to > Alexa). While its still "top site" biased, I think that's a pretty good > sample to work with. I believe we should be able to get the HAR files from > it. That would be one good source, although it's just to the "top" page of each site. If someone wants to own talking to Steve and getting the HARs in suitable shape for a pull request, that'd be much appreciated. I have a set of about 17 million links (to about 2 million distinct sites) that's more representative; it's sourced from a Wikipedia dump. Generating single-pageview HARs from them should be pretty straightforward, and it looks like getting a HAR from multiple navigations using PhantomJS is very doable: <https://github.com/ariya/phantomjs/wiki/Network-Monitoring>. Cheers, -- Mark Nottingham http://www.mnot.net/
Received on Monday, 7 January 2013 00:37:48 UTC