- From: Robin Berjon <robin@w3.org>
- Date: Tue, 05 Nov 2013 16:59:13 +0100
- To: Marcos Caceres <w3c@marcosc.com>
- CC: public-webdevdata@w3.org
On 05/11/2013 16:25 , Marcos Caceres wrote: > I wonder if we should start hosting the dataset on the W3C’s HG > server. Trying to d/l the latest data set has been really slow for me > (~1h today, but it was going to take 9h to d/l yesterday - and it’s > only 700mb). Also, having the data sets on HG means we can keep a > nice version history. Not speaking on behalf of the systeam or anything but... While W3C does have a nice infrastructure, I'm not sure that it's necessarily up to the task here. Also, please note that the HG server is often down. Also, I don't know if it's such a good idea to hold the snapshot zip in HG. I don't know how HG does its internal storage, but if it's anything like Git then *every* single zip snapshot will be kept. At 700MB a piece, that could increase pretty fast. (Plus all the unzipped content too.) This strikes me as the sort of thing that could get some form of corporate sponsorship. You know, hosting on Google, Akamai, Amazon, or whatever. One plan that I thought of at some point was to load all of the corpus into a MarkLogic DB (you can get it free for this sort of stuff) and allow for querying. It didn't play well with my (admittedly pretty broken) Linux box so I gave up, but it's something that could be quite swell. One thing it would enable is querying the full dataset using XQuery (while still having been loaded as HTML). I know the mention of XWhatever makes people balk, but XQuery is actually very well suited to this sort of task — and the result would be a lot faster (and richer in options) than grepping. -- Robin Berjon - http://berjon.com/ - @robinberjon
Received on Tuesday, 5 November 2013 15:59:22 UTC