Re: Dataset hosting from Robin Berjon on 2013-11-05 (public-webdevdata@w3.org from November 2013)

From: Robin Berjon <robin@w3.org>
Date: Tue, 05 Nov 2013 16:59:13 +0100
To: Marcos Caceres <w3c@marcosc.com>
CC: public-webdevdata@w3.org
Message-ID: <527915D1.30302@w3.org>

On 05/11/2013 16:25 , Marcos Caceres wrote:
> I wonder if we should start hosting the dataset on the W3C’s HG
> server. Trying to d/l the latest data set has been really slow for me
> (~1h today, but it was going to take 9h to d/l yesterday - and it’s
> only 700mb). Also, having the data sets on HG means we can keep a
> nice version history.

Not speaking on behalf of the systeam or anything but...

While W3C does have a nice infrastructure, I'm not sure that it's 
necessarily up to the task here. Also, please note that the HG server is 
often down.

Also, I don't know if it's such a good idea to hold the snapshot zip in 
HG. I don't know how HG does its internal storage, but if it's anything 
like Git then *every* single zip snapshot will be kept. At 700MB a 
piece, that could increase pretty fast. (Plus all the unzipped content too.)

This strikes me as the sort of thing that could get some form of 
corporate sponsorship. You know, hosting on Google, Akamai, Amazon, or 
whatever.

One plan that I thought of at some point was to load all of the corpus 
into a MarkLogic DB (you can get it free for this sort of stuff) and 
allow for querying. It didn't play well with my (admittedly pretty 
broken) Linux box so I gave up, but it's something that could be quite 
swell. One thing it would enable is querying the full dataset using 
XQuery (while still having been loaded as HTML). I know the mention of 
XWhatever makes people balk, but XQuery is actually very well suited to 
this sort of task — and the result would be a lot faster (and richer in 
options) than grepping.

-- 
Robin Berjon - http://berjon.com/ - @robinberjon

Received on Tuesday, 5 November 2013 15:59:22 UTC