- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Thu, 15 Dec 2011 21:52:00 -0500
- To: public-webapps@w3.org
On 12/14/11 4:52 PM, Boris Zbarsky wrote: > Ok. It's just a simple spider that starts with the list at > http://code.google.com/p/httparchive/source/browse/trunk/lists/All.txt > and for each of those urls loads the url itself and then follows all > same-host links from that page. So loads the front page of the site and > all the same-host one-level-deep pages. One more note. The data I have so far is from just looking at 1000 sites, not all 25000-some. John's still working on that last, now that he has this set up on more beefy hardware. -Boris
Received on Friday, 16 December 2011 02:59:56 UTC