Re: ANN: Sudoc bibliographic ans authority data from Yann NICOLAS on 2011-07-12 (public-lod@w3.org from July 2011)

From: Yann NICOLAS <nicolas@abes.fr>
Date: Tue, 12 Jul 2011 10:26:01 +0200
To: Giovanni Tummarello <giovanni.tummarello@deri.org>
CC: public-lod@w3.org, Antoine Isaac <aisaac@few.vu.nl>
Message-ID: <4E1C0519.1000500@abes.fr>
Thank you so much for your advice and your praise for *simplicity*.
A tribute to Ockam/Okkam i guess ;)

Yann

Le 11/07/2011 00:29, Giovanni Tummarello a écrit :
> hi Antoine, Yann all
>
> my advice is to keep it simple and complete.
>
> very simple indeed. Please forget about content negotiation. It was a
> horrible idea all alone, it doesn't work because it WILL break since
> no humans are looking at it. Really: anything that redirects and
> changes the URL when you put it in a browser is just so wrong
>
> have 1 single version of the page with rdfa+schema.org i know they say
> dont do that on schema.org but they're just being silly they will read
> microdata anyway (the schema part) the rdfa part its 1 line of code to
> extract if they want to do so if they dont who cares - they only care
> about the schema part anyway, let others use the rdf/a
>
> in terms of full crawling, if you allow of 1 url per second should be
> sustained this way data would be in in 3 months or so which seems
> still ridicolous but thats what search engine do. if you have the
> proper lastupdatd set that's great the updates will be just
> incremental
>
> otherwise yes a dump would allow us to ingest all in full but it is a
> manual operation betwen us and you
>
> these are my advices, this said i know that one might have several
> ideas/motivs etc which might be different from what these advices
> suggest. worry not. whoever consumes data better get ready to be
> pretty flexible, so we take all you offer really :)
> cheers
>
> Giovanni
>
> On Sun, Jul 10, 2011 at 12:22 PM, Antoine Isaac<aisaac@few.vu.nl>  wrote:
>> Yann, Giovanni,
>>
>>
>>> Which side effects are probable ?
>>
>> Giovanni has made the same comment on data.europeana.eu a couple of weeks
>> ago. The data we serve there is different from the RDFa mark-up on our web
>> portal.
>> We had some reasons to do this, including, well, that the RDFa data is
>> mixing the info and non-info resources for making easier data consumption
>> (not mandatorily by search engines, btw), and working with URIs that
>> pre-date our linked data service.
>>
>> The RDFa and the RDF obtained with LD-style conneg is also not about the
>> same URIs, which should avoid any confusion.
>> But I can understand that if Sindice tries to fetch both data sources, it
>> may assume the data to be the same. And this assumption could bring a number
>> of undesirable side effects if Sindice merges all what it gets...
>>
>> That being said, perhaps the solution lies in Sindice being less greedy ;-)
>> and just work with the first data source it finds, for a given URI.
>> I do like the idea of having several (simple) channels for data publication
>> over the web, which serve different goals.
>> Maybe we need to better articulate the practices and expectations, though...
>>
>> Cheers,
>>
>> Antoine
>>
>>
>>>   Hi Giovanni,
>>>
>>> Le 09/07/2011 23:10, Giovanni Tummarello a écrit :
>>>> Hi Nicolas,
>>>>
>>>> Its getting in Sindice indeed -
>>> Yes, I have noticed :)
>>>
>>>> quite politely e.g. 1 every 5 secs-
>>>> we'll monitor speed and completeness. iff you think its ok for us to
>>>> crawl faster please say so via robot.txt directive or just say so
>>> May I suggest that you crawl twice faster ?
>>>
>>>>
>>>> http://sindice.com/search?q=book&nq=&fq=domain%3Awww.sudoc.fr&sortbydate=1&interface=advanced
>>>>
>>>> at the same time i notice something funny in the markup e.g. if you go
>>>> with a browser you get redirected to something that has almost no data
>>>>
>>>> for example the sitemap contains
>>>>
>>>> http://www.sudoc.fr/000000043
>>>>
>>>> if you go there you get redirected to
>>>>
>>>> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043
>>>>
>>>> which if you put in the inspector
>>>>
>>>>
>>>> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.abes.fr%2FDB%3D2.1%2FSRCH%3FIKT%3D12%26TRM%3D000000043#TRIPLES
>>>>
>>>> you get very little data
>>>>
>>>> however of course if i use the inspector on
>>>> http://www.sudoc.fr/000000043  i get data
>>>>
>>>>
>>>> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.fr%2F000000043&content=&contentType=auto#TRIPLES
>>>>
>>>> which however is mostly schema.org data!
>>>>
>>>> but in sindice i have lots of RDF data with all sort of other ontologies
>>>>
>>>> http://sindice.com/search/page?url=http%3A%2F%2Fwww.sudoc.fr%2F000385123
>>>>
>>>> is there any way you could try to normalize all into a single markup
>>>> type? i think it would be easier to debug and ultimately better for
>>>> all..
>>> I will try to explain our intention, our constraints and the mechanism
>>> we've implemented.
>>>
>>> - Intention -
>>>
>>> We want to meet several needs :
>>> . providing RDF/XML to semantic-oriented clients like Sindice
>>> . providing HTML + schema.org microdata to traditional search engines like
>>> Google
>>> . providing an HTML UI to users
>>>
>>>
>>> - Constraints -
>>>
>>> . For some reasons, we can't add microdata to our traditional Sudoc UI.
>>> Hence the necessity of special HTML+microdata pages for search engines. :(
>>> . HTML+microdata pages and RDF pages can't support the same vocabularies,
>>> schema.org /oblige/.
>>>
>>>
>>> - Mechanisms -
>>>
>>> Let's start from : http://www.sudoc.fr/132133520
>>>
>>> . If RDF/XML is called by the request, we provide RDF/XML content (as if
>>> you had requested http://www.sudoc.fr/132133520.rdf)
>>> It is what Sindice Crawler is doing and getting : the 55,764 documents
>>> that are found in your index are composed of triples extracted from this
>>> RDF/XML page. It is what we expected. Fine :)
>>>
>>> . If our Apache server considers a user agent to be a robot and if this
>>> agent does not ask for RDF/XML, we provide special HTML content (as if you
>>> had requested http://www.sudoc.fr/132133520.html)
>>> It seems to work as Google cache contains this kind of HTML + schema.org
>>> microdata pages :
>>> http://webcache.googleusercontent.com/search?q=cache:www.sudoc.fr/000000043
>>>
>>> . In other cases, we redirect to our traditional and non semantic UI :
>>> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043
>>> . NB : we have planned to add this<link>  in this HTML page :<link
>>> rel="alternate" type="application/rdf+xml"
>>> href="http://www.sudoc.fr/000000043.rdf"/>  and<link rel="canonical"
>>> href="http://www.sudoc.fr/000000043"/>  to alleviate the URL confusion.
>>>
>>>
>>> - - - - -
>>>
>>> . It is not simple, but it seems to work, ie Google, Sindice and users
>>> seem to get what they should.
>>> . Is there a better way to obtain the same results ?
>>> . Which side effects are probable ?
>>>
>>> Thanks for your help and your attention !
>>>
>>> Yann
>>>
>>>> looking forward to support
>>>> Giovanni
>>>> Gio
>>>>
>>>>
>>>> On Fri, Jul 8, 2011 at 1:27 PM, Kingsley Idehen<kidehen@openlinksw.com>
>>>>   wrote:
>>>>> On 7/8/11 8:31 AM, Yann NICOLAS wrote:
>>>>>
>>>>> Le 08/07/2011 01:42, Kingsley Idehen a écrit  :
>>>>>
>>>>> On 7/7/11 10:17 PM, Yann NICOLAS wrote:
>>>>>
>>>>> Bonjour,
>>>>>
>>>>> Sudoc [1], the French academic union catalogue maintained by ABES [2],
>>>>> has
>>>>> just been released as linked open data.
>>>>>
>>>>> 10 million bibliographic records are now available as RDF/XML.
>>>>>
>>>>> Examples for the Sudoc record whose internal id is 132133520 :
>>>>> . Resource URI :http://www.sudoc.fr/132133520/id
>>>>> . Generic document :http://www.sudoc.fr/132133520  (content negotiation
>>>>> is
>>>>> supported)
>>>>>
>>>>>
>>>>> Great job!
>>>>>
>>>>> Is there an RDF dump anywhere?
>>>>>
>>>>>
>>>>> Sorry, we don't provide any dump, as the 10 000 000 files are generated
>>>>> on
>>>>> the fly from Oracle (stored as XML type+  some more tables).
>>>>> We provide a complete sitemap at
>>>>> http://www.sudoc.fr/noticesbiblio/sitemap.txt  , and we hope that
>>>>> Sindice
>>>>> will crawl the whole stuff.
>>>>> Would it help ?
>>>>>
>>>>> Any advice welcome,
>>>>>
>>>>> Yann
>>>>>
>>>>> --
>>>>> --
>>>>> Yann NICOLAS
>>>>> Etudes&    Projets
>>>>> ABES
>>>>>
>>>>> Okay, no problem with sitemaps as dump alternatives re. getting data
>>>>> imported into Linked Data hubs such our LOD cloud cache and Sindice
>>>>> etc..
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Regards,
>>>>>
>>>>> Kingsley Idehen
>>>>> President&    CEO
>>>>> OpenLink Software
>>>>> Web:http://www.openlinksw.com
>>>>> Weblog:http://www.openlinksw.com/blog/~kidehen
>>>>> Twitter/Identi.ca: kidehen
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> --
>>> Yann NICOLAS
>>> Etudes&    Projets
>>> ABES
>>>
>>
>>


-- 
--
Yann NICOLAS
Etudes&  Projets
ABES
Received on Tuesday, 12 July 2011 08:26:29 UTC