- From: William Waites <ww@styx.org>
- Date: Fri, 24 Dec 2010 11:34:48 +0100
- To: Michael Brunnbauer <brunni@netestate.de>
- Cc: semantic-web@w3.org
* [2010-12-23 17:51:41 +0100] Michael Brunnbauer <brunni@netestate.de> écrit: ] On Thu, Dec 23, 2010 at 05:40:43PM +0100, William Waites wrote: ] > Hi Michael, this is good news. But i have a question: is it possible ] > to point you robot at a dump to prevent it mercilessly crawling large ] > datasets like bnb.bibliographica.org? If so, how? ] ] As we use named graphs for provenance tracking, I see no way to make use of ] a dump. Our crawler waits at least 10 secs between two requests to the same ] site. Of course I can block crawling of bnb.bibliographica.org if you want. ] How many RDFs and pages with RDFa does it have ? The HTML+RDFa pages are just a (slightly abbreviated) version of the corresponding graph, made with fresnel. For rdf-consuming robots it really is better to look at the native version (via content-negotiation or requesting ${uri}.rdf). In this case there are about 3 million distinct graphs and if you crawl blindly you'll also get another several million cbds for authors and publishers. At that rate it may take several years for the crawl to finish... Cheers, -w -- William Waites http://eris.okfn.org/ww/foaf#i 9C7E F636 52F6 1004 E40A E565 98E3 BBF3 8320 7664
Received on Friday, 24 December 2010 10:35:18 UTC