Re: Dataset hosting from David Newton on 2013-11-05 (public-webdevdata@w3.org from November 2013)

From: David Newton <david@davidnewton.ca>
Date: Tue, 5 Nov 2013 10:38:21 -0500
To: Marcos Caceres <w3c@marcosc.com>
Cc: public-webdevdata@w3.org
Message-Id: <55229B22-CF39-4C5D-9EB3-998236852A6D@davidnewton.ca>

On Nov 5, 2013, at 10:32 AM, Marcos Caceres <w3c@marcosc.com> wrote:

> 
> On November 5, 2013 at 3:29:33 PM, David Newton (david@davidnewton.ca) wrote:
>> 
>> +1 (if they’ll let us)
>> Would we also be able to schedule an automated task on their server to regenerate it periodically?
> 
> I doubt it, as it’s fairly brute force what we are doing. However, we could take turns within the group. I’m happy to do the next batch at the end of Nov.   
> 
> Oh, another thing - we should cap the number of sites that we d/l to 100,000k. That way, we can do proper longitudinal studies of the data. 

I’m assuming you mean 100k, not 100,000k. :)
That should be fairly easy to add to the script. Do we want the 100k top sites, which will produce fewer than 100k downloads because of errors, or the first 100k sites we’re able to successfully grab?

Received on Tuesday, 5 November 2013 15:38:46 UTC