- From: Yann Nicolas <nicolas@abes.fr>
- Date: Sun, 10 Jul 2011 14:02:24 +0200 (CEST)
- To: Karl Dubost <karld@opera.com>
- Cc: Giovanni Tummarello <giovanni.tummarello@deri.org>, public-lod@w3.org, Kingsley Idehen <kidehen@openlinksw.com>, "giulio.cesare@gmail.com" <giulio.solaroli@deri.org>
Hi Karl, pssst : my parents told me that my first name is Yann:) Too late to change ;) more below ----- Mail original ----- > De: "Karl Dubost" <karld@opera.com> > À: nicolas@abes.fr > Cc: "Giovanni Tummarello" <giovanni.tummarello@deri.org>, public-lod@w3.org, "Kingsley Idehen" > <kidehen@openlinksw.com>, "giulio.cesare@gmail.com" <giulio.solaroli@deri.org> > Envoyé: Dimanche 10 Juillet 2011 13:29:10 > Objet: Re: ANN: Sudoc bibliographic ans authority data > Bonjour Nicolas, > > First of all, very cool. Two comments. > > > # INITIAL INDEXING > > Le 9 juil. 2011 à 19:36, Yann NICOLAS a écrit : > >> quite politely e.g. 1 every 5 secs- > > May I suggest that you crawl twice faster ? > > 1 every 2.5s > > Le 8 juil. 2011 à 03:31, Yann NICOLAS a écrit : > > Sorry, we don't provide any dump, as the 10 000 000 files are > > generated on the fly from > > It means the crawl will be done in… 289 days. > There should be an easier way for the initial crawling (an initial > dump for some specific search engines, once), then update depending on > the last update. Specifically when there is a sitemap. You're right. Of course. We are already thinking about a dump. > > > # CACHING > > The cache policy seems to not do a good use of your HTTP resources. > > % curl -sI -H "Accept:application/rdf+xml" > http://www.sudoc.fr/132133520 > HTTP/1.1 200 OK > Date: Sun, 10 Jul 2011 11:22:05 GMT > Cache-Control: no-store > Expires: Thu, 01 Jan 1970 00:00:00 GMT > Cache-Control: no-cache > Cache-Control: max-age=0 > Content-Type: application/rdf+xml;charset=UTF-8 > Content-Length: 4105 > > Doing again the same HTTP request a few seconds later, the date is now > Date: Sun, 10 Jul 2011 11:24:29 GMT > > These are not cached at all. I do not think it is a good idea, plus > wrong information for things like Expires :) Maybe, it is just because > the service is starting and there are still things to tweak. Thanks ! We're going to consider these issues. Yann > > > -- > Karl Dubost - http://dev.opera.com/ > Developer Relations & Tools, Opera Software
Received on Sunday, 10 July 2011 12:02:52 UTC