W3C home > Mailing lists > Public > public-lod@w3.org > July 2011

Re: ANN: Sudoc bibliographic ans authority data

From: Yann Nicolas <nicolas@abes.fr>
Date: Sun, 10 Jul 2011 14:02:24 +0200 (CEST)
To: Karl Dubost <karld@opera.com>
Cc: Giovanni Tummarello <giovanni.tummarello@deri.org>, public-lod@w3.org, Kingsley Idehen <kidehen@openlinksw.com>, "giulio.cesare@gmail.com" <giulio.solaroli@deri.org>
Message-ID: <5749761.95846.1310299344741.JavaMail.root@eole.abes.fr>
Hi Karl,

pssst : my parents told me that my first name is Yann:)
Too late to change ;)


more below

----- Mail original -----
> De: "Karl Dubost" <karld@opera.com>
> À: nicolas@abes.fr
> Cc: "Giovanni Tummarello" <giovanni.tummarello@deri.org>, public-lod@w3.org, "Kingsley Idehen"
> <kidehen@openlinksw.com>, "giulio.cesare@gmail.com" <giulio.solaroli@deri.org>
> Envoyé: Dimanche 10 Juillet 2011 13:29:10
> Objet: Re: ANN: Sudoc bibliographic ans authority data
> Bonjour Nicolas,
> 
> First of all, very cool. Two comments.
> 
> 
> # INITIAL INDEXING
> 
> Le 9 juil. 2011 à 19:36, Yann NICOLAS a écrit :
> >> quite politely e.g. 1 every 5 secs-
> > May I suggest that you crawl twice faster ?
> 
> 1 every 2.5s
> 
> Le 8 juil. 2011 à 03:31, Yann NICOLAS a écrit :
> > Sorry, we don't provide any dump, as the 10 000 000 files are
> > generated on the fly from
> 
> It means the crawl will be done in… 289 days.
> There should be an easier way for the initial crawling (an initial
> dump for some specific search engines, once), then update depending on
> the last update. Specifically when there is a sitemap.


You're right. Of course.
We are already thinking about a dump.

> 
> 
> # CACHING
> 
> The cache policy seems to not do a good use of your HTTP resources.
> 
> % curl -sI -H "Accept:application/rdf+xml"
> http://www.sudoc.fr/132133520
> HTTP/1.1 200 OK
> Date: Sun, 10 Jul 2011 11:22:05 GMT
> Cache-Control: no-store
> Expires: Thu, 01 Jan 1970 00:00:00 GMT
> Cache-Control: no-cache
> Cache-Control: max-age=0
> Content-Type: application/rdf+xml;charset=UTF-8
> Content-Length: 4105
> 
> Doing again the same HTTP request a few seconds later, the date is now
> Date: Sun, 10 Jul 2011 11:24:29 GMT
> 
> These are not cached at all. I do not think it is a good idea, plus
> wrong information for things like Expires :) Maybe, it is just because
> the service is starting and there are still things to tweak.


Thanks !
We're going to consider these issues.

Yann



> 
> 
> --
> Karl Dubost - http://dev.opera.com/
> Developer Relations & Tools, Opera Software
Received on Sunday, 10 July 2011 12:02:52 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:34 UTC