W3C home > Mailing lists > Public > public-lod@w3.org > July 2011

Re: ANN: Sudoc bibliographic ans authority data

From: Karl Dubost <karld@opera.com>
Date: Sun, 10 Jul 2011 07:29:10 -0400
Message-Id: <3583F5A4-D41E-41E2-B2FC-F9AED3A6B9AC@opera.com>
Cc: Giovanni Tummarello <giovanni.tummarello@deri.org>, public-lod@w3.org, Kingsley Idehen <kidehen@openlinksw.com>, "giulio.cesare@gmail.com" <giulio.solaroli@deri.org>
To: nicolas@abes.fr
Bonjour Nicolas,

First of all, very cool. Two comments.


Le 9 juil. 2011 à 19:36, Yann NICOLAS a écrit :
>> quite politely e.g. 1 every 5 secs-
> May I suggest that you crawl twice faster ?

1 every 2.5s

Le 8 juil. 2011 à 03:31, Yann NICOLAS a écrit :
> Sorry, we don't provide any dump, as the 10 000 000 files are generated on the fly from 

It means the crawl will be done in… 289 days.
There should be an easier way for the initial crawling (an initial dump for some specific search engines, once), then update depending on the last update. Specifically when there is a sitemap.


The cache policy seems to not do a good use of your HTTP resources.

% curl -sI -H "Accept:application/rdf+xml" http://www.sudoc.fr/132133520
HTTP/1.1 200 OK
Date: Sun, 10 Jul 2011 11:22:05 GMT
Cache-Control: no-store
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: no-cache
Cache-Control: max-age=0
Content-Type: application/rdf+xml;charset=UTF-8
Content-Length: 4105

Doing again the same HTTP request a few seconds later, the date is now 
Date: Sun, 10 Jul 2011 11:24:29 GMT

These are not cached at all. I do not think it is a good idea, plus wrong information for things like Expires :)  Maybe, it is just because the service is starting and there are still things to tweak. 

Karl Dubost - http://dev.opera.com/
Developer Relations & Tools, Opera Software
Received on Sunday, 10 July 2011 11:29:59 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:16:15 UTC