Re: ANN: Sudoc bibliographic ans authority data from Giovanni Tummarello on 2011-07-10 (public-lod@w3.org from July 2011)

From: Giovanni Tummarello <giovanni.tummarello@deri.org>
Date: Mon, 11 Jul 2011 00:29:30 +0200
To: Antoine Isaac <aisaac@few.vu.nl>
Cc: public-lod@w3.org
Message-ID: <CAHHRs7h5_1LqK46J3BLNtD2cpis1X8UrtvxPdrPhuhRXb16ZvA@mail.gmail.com>
hi Antoine, Yann all

my advice is to keep it simple and complete.

very simple indeed. Please forget about content negotiation. It was a
horrible idea all alone, it doesn't work because it WILL break since
no humans are looking at it. Really: anything that redirects and
changes the URL when you put it in a browser is just so wrong

have 1 single version of the page with rdfa+schema.org i know they say
dont do that on schema.org but they're just being silly they will read
microdata anyway (the schema part) the rdfa part its 1 line of code to
extract if they want to do so if they dont who cares - they only care
about the schema part anyway, let others use the rdf/a

in terms of full crawling, if you allow of 1 url per second should be
sustained this way data would be in in 3 months or so which seems
still ridicolous but thats what search engine do. if you have the
proper lastupdatd set that's great the updates will be just
incremental

otherwise yes a dump would allow us to ingest all in full but it is a
manual operation betwen us and you

these are my advices, this said i know that one might have several
ideas/motivs etc which might be different from what these advices
suggest. worry not. whoever consumes data better get ready to be
pretty flexible, so we take all you offer really :)
cheers

Giovanni

On Sun, Jul 10, 2011 at 12:22 PM, Antoine Isaac <aisaac@few.vu.nl> wrote:
> Yann, Giovanni,
>
>
>> Which side effects are probable ?
>
>
> Giovanni has made the same comment on data.europeana.eu a couple of weeks
> ago. The data we serve there is different from the RDFa mark-up on our web
> portal.
> We had some reasons to do this, including, well, that the RDFa data is
> mixing the info and non-info resources for making easier data consumption
> (not mandatorily by search engines, btw), and working with URIs that
> pre-date our linked data service.
>
> The RDFa and the RDF obtained with LD-style conneg is also not about the
> same URIs, which should avoid any confusion.
> But I can understand that if Sindice tries to fetch both data sources, it
> may assume the data to be the same. And this assumption could bring a number
> of undesirable side effects if Sindice merges all what it gets...
>
> That being said, perhaps the solution lies in Sindice being less greedy ;-)
> and just work with the first data source it finds, for a given URI.
> I do like the idea of having several (simple) channels for data publication
> over the web, which serve different goals.
> Maybe we need to better articulate the practices and expectations, though...
>
> Cheers,
>
> Antoine
>
>
>>  Hi Giovanni,
>>
>> Le 09/07/2011 23:10, Giovanni Tummarello a écrit :
>>>
>>> Hi Nicolas,
>>>
>>> Its getting in Sindice indeed -
>>
>> Yes, I have noticed :)
>>
>>> quite politely e.g. 1 every 5 secs-
>>> we'll monitor speed and completeness. iff you think its ok for us to
>>> crawl faster please say so via robot.txt directive or just say so
>>
>> May I suggest that you crawl twice faster ?
>>
>>>
>>>
>>> http://sindice.com/search?q=book&nq=&fq=domain%3Awww.sudoc.fr&sortbydate=1&interface=advanced
>>>
>>> at the same time i notice something funny in the markup e.g. if you go
>>> with a browser you get redirected to something that has almost no data
>>>
>>> for example the sitemap contains
>>>
>>> http://www.sudoc.fr/000000043
>>>
>>> if you go there you get redirected to
>>>
>>> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043
>>>
>>> which if you put in the inspector
>>>
>>>
>>> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.abes.fr%2FDB%3D2.1%2FSRCH%3FIKT%3D12%26TRM%3D000000043#TRIPLES
>>>
>>> you get very little data
>>>
>>> however of course if i use the inspector on
>>> http://www.sudoc.fr/000000043  i get data
>>>
>>>
>>> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.fr%2F000000043&content=&contentType=auto#TRIPLES
>>>
>>> which however is mostly schema.org data!
>>>
>>> but in sindice i have lots of RDF data with all sort of other ontologies
>>>
>>> http://sindice.com/search/page?url=http%3A%2F%2Fwww.sudoc.fr%2F000385123
>>>
>>> is there any way you could try to normalize all into a single markup
>>> type? i think it would be easier to debug and ultimately better for
>>> all..
>>
>> I will try to explain our intention, our constraints and the mechanism
>> we've implemented.
>>
>> - Intention -
>>
>> We want to meet several needs :
>> . providing RDF/XML to semantic-oriented clients like Sindice
>> . providing HTML + schema.org microdata to traditional search engines like
>> Google
>> . providing an HTML UI to users
>>
>>
>> - Constraints -
>>
>> . For some reasons, we can't add microdata to our traditional Sudoc UI.
>> Hence the necessity of special HTML+microdata pages for search engines. :(
>> . HTML+microdata pages and RDF pages can't support the same vocabularies,
>> schema.org /oblige/.
>>
>>
>> - Mechanisms -
>>
>> Let's start from : http://www.sudoc.fr/132133520
>>
>> . If RDF/XML is called by the request, we provide RDF/XML content (as if
>> you had requested http://www.sudoc.fr/132133520.rdf)
>> It is what Sindice Crawler is doing and getting : the 55,764 documents
>> that are found in your index are composed of triples extracted from this
>> RDF/XML page. It is what we expected. Fine :)
>>
>> . If our Apache server considers a user agent to be a robot and if this
>> agent does not ask for RDF/XML, we provide special HTML content (as if you
>> had requested http://www.sudoc.fr/132133520.html)
>> It seems to work as Google cache contains this kind of HTML + schema.org
>> microdata pages :
>> http://webcache.googleusercontent.com/search?q=cache:www.sudoc.fr/000000043
>>
>> . In other cases, we redirect to our traditional and non semantic UI :
>> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043
>> . NB : we have planned to add this <link> in this HTML page : <link
>> rel="alternate" type="application/rdf+xml"
>> href="http://www.sudoc.fr/000000043.rdf"/> and <link rel="canonical"
>> href="http://www.sudoc.fr/000000043"/> to alleviate the URL confusion.
>>
>>
>> - - - - -
>>
>> . It is not simple, but it seems to work, ie Google, Sindice and users
>> seem to get what they should.
>> . Is there a better way to obtain the same results ?
>> . Which side effects are probable ?
>>
>> Thanks for your help and your attention !
>>
>> Yann
>>
>>>
>>> looking forward to support
>>> Giovanni
>>> Gio
>>>
>>>
>>> On Fri, Jul 8, 2011 at 1:27 PM, Kingsley Idehen<kidehen@openlinksw.com>
>>>  wrote:
>>>>
>>>> On 7/8/11 8:31 AM, Yann NICOLAS wrote:
>>>>
>>>> Le 08/07/2011 01:42, Kingsley Idehen a écrit  :
>>>>
>>>> On 7/7/11 10:17 PM, Yann NICOLAS wrote:
>>>>
>>>> Bonjour,
>>>>
>>>> Sudoc [1], the French academic union catalogue maintained by ABES [2],
>>>> has
>>>> just been released as linked open data.
>>>>
>>>> 10 million bibliographic records are now available as RDF/XML.
>>>>
>>>> Examples for the Sudoc record whose internal id is 132133520 :
>>>> . Resource URI :http://www.sudoc.fr/132133520/id
>>>> . Generic document :http://www.sudoc.fr/132133520  (content negotiation
>>>> is
>>>> supported)
>>>>
>>>>
>>>> Great job!
>>>>
>>>> Is there an RDF dump anywhere?
>>>>
>>>>
>>>> Sorry, we don't provide any dump, as the 10 000 000 files are generated
>>>> on
>>>> the fly from Oracle (stored as XML type+  some more tables).
>>>> We provide a complete sitemap at
>>>> http://www.sudoc.fr/noticesbiblio/sitemap.txt  , and we hope that
>>>> Sindice
>>>> will crawl the whole stuff.
>>>> Would it help ?
>>>>
>>>> Any advice welcome,
>>>>
>>>> Yann
>>>>
>>>> --
>>>> --
>>>> Yann NICOLAS
>>>> Etudes&  Projets
>>>> ABES
>>>>
>>>> Okay, no problem with sitemaps as dump alternatives re. getting data
>>>> imported into Linked Data hubs such our LOD cloud cache and Sindice
>>>> etc..
>>>>
>>>>
>>>> --
>>>>
>>>> Regards,
>>>>
>>>> Kingsley Idehen
>>>> President&  CEO
>>>> OpenLink Software
>>>> Web:http://www.openlinksw.com
>>>> Weblog:http://www.openlinksw.com/blog/~kidehen
>>>> Twitter/Identi.ca: kidehen
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>> --
>> --
>> Yann NICOLAS
>> Etudes&  Projets
>> ABES
>>
>
>
>
Received on Sunday, 10 July 2011 22:30:17 UTC