Re: ANN: Sudoc bibliographic ans authority data from Karl Dubost on 2011-07-10 (public-lod@w3.org from July 2011)

From: Karl Dubost <karld@opera.com>
Date: Sun, 10 Jul 2011 07:29:10 -0400
To: nicolas@abes.fr
Cc: Giovanni Tummarello <giovanni.tummarello@deri.org>, public-lod@w3.org, Kingsley Idehen <kidehen@openlinksw.com>, "giulio.cesare@gmail.com" <giulio.solaroli@deri.org>
Message-Id: <3583F5A4-D41E-41E2-B2FC-F9AED3A6B9AC@opera.com>

Bonjour Nicolas,

First of all, very cool. Two comments.


# INITIAL INDEXING

Le 9 juil. 2011 à 19:36, Yann NICOLAS a écrit :
>> quite politely e.g. 1 every 5 secs-
> May I suggest that you crawl twice faster ?

1 every 2.5s

Le 8 juil. 2011 à 03:31, Yann NICOLAS a écrit :
> Sorry, we don't provide any dump, as the 10 000 000 files are generated on the fly from 

It means the crawl will be done in… 289 days.
There should be an easier way for the initial crawling (an initial dump for some specific search engines, once), then update depending on the last update. Specifically when there is a sitemap.


# CACHING

The cache policy seems to not do a good use of your HTTP resources.

% curl -sI -H "Accept:application/rdf+xml" http://www.sudoc.fr/132133520
HTTP/1.1 200 OK
Date: Sun, 10 Jul 2011 11:22:05 GMT
Cache-Control: no-store
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: no-cache
Cache-Control: max-age=0
Content-Type: application/rdf+xml;charset=UTF-8
Content-Length: 4105

Doing again the same HTTP request a few seconds later, the date is now 
Date: Sun, 10 Jul 2011 11:24:29 GMT

These are not cached at all. I do not think it is a good idea, plus wrong information for things like Expires :)  Maybe, it is just because the service is starting and there are still things to tweak. 


-- 
Karl Dubost - http://dev.opera.com/
Developer Relations & Tools, Opera Software

Received on Sunday, 10 July 2011 11:29:59 UTC