- From: Karl Dubost <karld@opera.com>
- Date: Sun, 10 Jul 2011 07:29:10 -0400
- To: nicolas@abes.fr
- Cc: Giovanni Tummarello <giovanni.tummarello@deri.org>, public-lod@w3.org, Kingsley Idehen <kidehen@openlinksw.com>, "giulio.cesare@gmail.com" <giulio.solaroli@deri.org>
Bonjour Nicolas, First of all, very cool. Two comments. # INITIAL INDEXING Le 9 juil. 2011 à 19:36, Yann NICOLAS a écrit : >> quite politely e.g. 1 every 5 secs- > May I suggest that you crawl twice faster ? 1 every 2.5s Le 8 juil. 2011 à 03:31, Yann NICOLAS a écrit : > Sorry, we don't provide any dump, as the 10 000 000 files are generated on the fly from It means the crawl will be done in… 289 days. There should be an easier way for the initial crawling (an initial dump for some specific search engines, once), then update depending on the last update. Specifically when there is a sitemap. # CACHING The cache policy seems to not do a good use of your HTTP resources. % curl -sI -H "Accept:application/rdf+xml" http://www.sudoc.fr/132133520 HTTP/1.1 200 OK Date: Sun, 10 Jul 2011 11:22:05 GMT Cache-Control: no-store Expires: Thu, 01 Jan 1970 00:00:00 GMT Cache-Control: no-cache Cache-Control: max-age=0 Content-Type: application/rdf+xml;charset=UTF-8 Content-Length: 4105 Doing again the same HTTP request a few seconds later, the date is now Date: Sun, 10 Jul 2011 11:24:29 GMT These are not cached at all. I do not think it is a good idea, plus wrong information for things like Expires :) Maybe, it is just because the service is starting and there are still things to tweak. -- Karl Dubost - http://dev.opera.com/ Developer Relations & Tools, Opera Software
Received on Sunday, 10 July 2011 11:29:59 UTC