Re: ANN: Sudoc bibliographic ans authority data

Hi Giovanni,

Le 09/07/2011 23:10, Giovanni Tummarello a écrit :
> Hi Nicolas,
>
> Its getting in Sindice indeed -

Yes, I have noticed :)

> quite politely e.g. 1 every 5 secs-
> we'll monitor speed and completeness. iff you think its ok for us to
> crawl faster please say so via robot.txt directive or just say so
May I suggest that you crawl twice faster ?

>
> http://sindice.com/search?q=book&nq=&fq=domain%3Awww.sudoc.fr&sortbydate=1&interface=advanced
>
> at the same time i notice something funny in the markup e.g. if you go
> with a browser you get redirected to something that has almost no data
>
> for example the sitemap contains
>
> http://www.sudoc.fr/000000043
>
> if you go there you get redirected to
>
> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043
>
> which if you put in the inspector
>
> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.abes.fr%2FDB%3D2.1%2FSRCH%3FIKT%3D12%26TRM%3D000000043#TRIPLES
>
> you get very little data
>
> however of course if i use the inspector on
> http://www.sudoc.fr/000000043 i get data
>
> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.fr%2F000000043&content=&contentType=auto#TRIPLES
>
> which however is mostly schema.org data!
>
> but in sindice i have lots of RDF data with all sort of other ontologies
>
> http://sindice.com/search/page?url=http%3A%2F%2Fwww.sudoc.fr%2F000385123
>
> is there any way you could try to normalize all into a single markup
> type? i think it would be easier to debug and ultimately better for
> all..

I will try to explain our intention, our constraints and the mechanism 
we've implemented.

- Intention -

We want to meet several needs :
. providing RDF/XML to semantic-oriented clients like Sindice
. providing HTML + schema.org microdata to traditional search engines 
like Google
. providing an HTML UI to users


- Constraints -

. For some reasons, we can't add microdata  to our traditional Sudoc UI. 
Hence the necessity of special HTML+microdata pages for search engines. :(
. HTML+microdata pages and RDF pages can't support the same 
vocabularies, schema.org /oblige/.


- Mechanisms -

Let's start from : http://www.sudoc.fr/132133520

. If RDF/XML is called by the request, we provide RDF/XML content (as if 
you had requested http://www.sudoc.fr/132133520.rdf)
     It is what Sindice Crawler is doing and getting : the 55,764 
documents that are found in your index are composed of triples extracted 
from this RDF/XML page. It is what we expected. Fine :)

. If our Apache server considers a user agent to be a robot and if this 
agent does not ask for RDF/XML, we provide special HTML content (as if 
you had requested http://www.sudoc.fr/132133520.html)
     It seems to work as Google cache contains this kind of HTML + 
schema.org microdata pages : 
http://webcache.googleusercontent.com/search?q=cache:www.sudoc.fr/000000043

. In other cases, we redirect to our traditional and non semantic UI : 
http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043
     . NB : we have planned to add this <link> in this HTML page : <link 
rel="alternate" type="application/rdf+xml" 
href="http://www.sudoc.fr/000000043.rdf"/> and <link rel="canonical" 
href="http://www.sudoc.fr/000000043"/> to alleviate the URL confusion.


- - - - -

. It is not simple, but it seems to work, ie Google, Sindice and users 
seem to get what they should.
. Is there a better way to obtain the same results ?
. Which side effects are probable ?

Thanks for your help and your attention !

Yann

>
> looking forward to support
> Giovanni
> Gio
>
>
> On Fri, Jul 8, 2011 at 1:27 PM, Kingsley Idehen<kidehen@openlinksw.com>  wrote:
>> On 7/8/11 8:31 AM, Yann NICOLAS wrote:
>>
>> Le 08/07/2011 01:42, Kingsley Idehen a écrit :
>>
>> On 7/7/11 10:17 PM, Yann NICOLAS wrote:
>>
>> Bonjour,
>>
>> Sudoc [1], the French academic union catalogue maintained by ABES [2], has
>> just been released as linked open data.
>>
>> 10 million bibliographic records are now available as RDF/XML.
>>
>> Examples for the Sudoc record whose internal id is 132133520 :
>> . Resource URI : http://www.sudoc.fr/132133520/id
>> . Generic document : http://www.sudoc.fr/132133520 (content negotiation is
>> supported)
>>
>>
>> Great job!
>>
>> Is there an RDF dump anywhere?
>>
>>
>> Sorry, we don't provide any dump, as the 10 000 000 files are generated on
>> the fly from Oracle (stored as XML type + some more tables).
>> We provide a complete sitemap at
>> http://www.sudoc.fr/noticesbiblio/sitemap.txt , and we hope that Sindice
>> will crawl the whole stuff.
>> Would it help ?
>>
>> Any advice welcome,
>>
>> Yann
>>
>> --
>> --
>> Yann NICOLAS
>> Etudes&  Projets
>> ABES
>>
>> Okay, no problem with sitemaps as dump alternatives re. getting data
>> imported into Linked Data hubs such our LOD cloud cache and Sindice etc..
>>
>>
>> --
>>
>> Regards,
>>
>> Kingsley Idehen	
>> President&  CEO
>> OpenLink Software
>> Web: http://www.openlinksw.com
>> Weblog: http://www.openlinksw.com/blog/~kidehen
>> Twitter/Identi.ca: kidehen
>>
>>
>>
>>
>>


-- 
--
Yann NICOLAS
Etudes&  Projets
ABES

Received on Saturday, 9 July 2011 23:36:49 UTC