- From: Yann NICOLAS <nicolas@abes.fr>
- Date: Sun, 10 Jul 2011 01:36:06 +0200
- To: Giovanni Tummarello <giovanni.tummarello@deri.org>
- CC: public-lod@w3.org, Kingsley Idehen <kidehen@openlinksw.com>, "giulio.cesare@gmail.com" <giulio.solaroli@deri.org>
- Message-ID: <4E18E5E6.80703@abes.fr>
Hi Giovanni, Le 09/07/2011 23:10, Giovanni Tummarello a écrit : > Hi Nicolas, > > Its getting in Sindice indeed - Yes, I have noticed :) > quite politely e.g. 1 every 5 secs- > we'll monitor speed and completeness. iff you think its ok for us to > crawl faster please say so via robot.txt directive or just say so May I suggest that you crawl twice faster ? > > http://sindice.com/search?q=book&nq=&fq=domain%3Awww.sudoc.fr&sortbydate=1&interface=advanced > > at the same time i notice something funny in the markup e.g. if you go > with a browser you get redirected to something that has almost no data > > for example the sitemap contains > > http://www.sudoc.fr/000000043 > > if you go there you get redirected to > > http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043 > > which if you put in the inspector > > http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.abes.fr%2FDB%3D2.1%2FSRCH%3FIKT%3D12%26TRM%3D000000043#TRIPLES > > you get very little data > > however of course if i use the inspector on > http://www.sudoc.fr/000000043 i get data > > http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.fr%2F000000043&content=&contentType=auto#TRIPLES > > which however is mostly schema.org data! > > but in sindice i have lots of RDF data with all sort of other ontologies > > http://sindice.com/search/page?url=http%3A%2F%2Fwww.sudoc.fr%2F000385123 > > is there any way you could try to normalize all into a single markup > type? i think it would be easier to debug and ultimately better for > all.. I will try to explain our intention, our constraints and the mechanism we've implemented. - Intention - We want to meet several needs : . providing RDF/XML to semantic-oriented clients like Sindice . providing HTML + schema.org microdata to traditional search engines like Google . providing an HTML UI to users - Constraints - . For some reasons, we can't add microdata to our traditional Sudoc UI. Hence the necessity of special HTML+microdata pages for search engines. :( . HTML+microdata pages and RDF pages can't support the same vocabularies, schema.org /oblige/. - Mechanisms - Let's start from : http://www.sudoc.fr/132133520 . If RDF/XML is called by the request, we provide RDF/XML content (as if you had requested http://www.sudoc.fr/132133520.rdf) It is what Sindice Crawler is doing and getting : the 55,764 documents that are found in your index are composed of triples extracted from this RDF/XML page. It is what we expected. Fine :) . If our Apache server considers a user agent to be a robot and if this agent does not ask for RDF/XML, we provide special HTML content (as if you had requested http://www.sudoc.fr/132133520.html) It seems to work as Google cache contains this kind of HTML + schema.org microdata pages : http://webcache.googleusercontent.com/search?q=cache:www.sudoc.fr/000000043 . In other cases, we redirect to our traditional and non semantic UI : http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043 . NB : we have planned to add this <link> in this HTML page : <link rel="alternate" type="application/rdf+xml" href="http://www.sudoc.fr/000000043.rdf"/> and <link rel="canonical" href="http://www.sudoc.fr/000000043"/> to alleviate the URL confusion. - - - - - . It is not simple, but it seems to work, ie Google, Sindice and users seem to get what they should. . Is there a better way to obtain the same results ? . Which side effects are probable ? Thanks for your help and your attention ! Yann > > looking forward to support > Giovanni > Gio > > > On Fri, Jul 8, 2011 at 1:27 PM, Kingsley Idehen<kidehen@openlinksw.com> wrote: >> On 7/8/11 8:31 AM, Yann NICOLAS wrote: >> >> Le 08/07/2011 01:42, Kingsley Idehen a écrit : >> >> On 7/7/11 10:17 PM, Yann NICOLAS wrote: >> >> Bonjour, >> >> Sudoc [1], the French academic union catalogue maintained by ABES [2], has >> just been released as linked open data. >> >> 10 million bibliographic records are now available as RDF/XML. >> >> Examples for the Sudoc record whose internal id is 132133520 : >> . Resource URI : http://www.sudoc.fr/132133520/id >> . Generic document : http://www.sudoc.fr/132133520 (content negotiation is >> supported) >> >> >> Great job! >> >> Is there an RDF dump anywhere? >> >> >> Sorry, we don't provide any dump, as the 10 000 000 files are generated on >> the fly from Oracle (stored as XML type + some more tables). >> We provide a complete sitemap at >> http://www.sudoc.fr/noticesbiblio/sitemap.txt , and we hope that Sindice >> will crawl the whole stuff. >> Would it help ? >> >> Any advice welcome, >> >> Yann >> >> -- >> -- >> Yann NICOLAS >> Etudes& Projets >> ABES >> >> Okay, no problem with sitemaps as dump alternatives re. getting data >> imported into Linked Data hubs such our LOD cloud cache and Sindice etc.. >> >> >> -- >> >> Regards, >> >> Kingsley Idehen >> President& CEO >> OpenLink Software >> Web: http://www.openlinksw.com >> Weblog: http://www.openlinksw.com/blog/~kidehen >> Twitter/Identi.ca: kidehen >> >> >> >> >> -- -- Yann NICOLAS Etudes& Projets ABES
Received on Saturday, 9 July 2011 23:36:49 UTC