- From: Yann NICOLAS <nicolas@abes.fr>
- Date: Tue, 12 Jul 2011 10:26:01 +0200
- To: Giovanni Tummarello <giovanni.tummarello@deri.org>
- CC: public-lod@w3.org, Antoine Isaac <aisaac@few.vu.nl>
Thank you so much for your advice and your praise for *simplicity*. A tribute to Ockam/Okkam i guess ;) Yann Le 11/07/2011 00:29, Giovanni Tummarello a écrit : > hi Antoine, Yann all > > my advice is to keep it simple and complete. > > very simple indeed. Please forget about content negotiation. It was a > horrible idea all alone, it doesn't work because it WILL break since > no humans are looking at it. Really: anything that redirects and > changes the URL when you put it in a browser is just so wrong > > have 1 single version of the page with rdfa+schema.org i know they say > dont do that on schema.org but they're just being silly they will read > microdata anyway (the schema part) the rdfa part its 1 line of code to > extract if they want to do so if they dont who cares - they only care > about the schema part anyway, let others use the rdf/a > > in terms of full crawling, if you allow of 1 url per second should be > sustained this way data would be in in 3 months or so which seems > still ridicolous but thats what search engine do. if you have the > proper lastupdatd set that's great the updates will be just > incremental > > otherwise yes a dump would allow us to ingest all in full but it is a > manual operation betwen us and you > > these are my advices, this said i know that one might have several > ideas/motivs etc which might be different from what these advices > suggest. worry not. whoever consumes data better get ready to be > pretty flexible, so we take all you offer really :) > cheers > > Giovanni > > On Sun, Jul 10, 2011 at 12:22 PM, Antoine Isaac<aisaac@few.vu.nl> wrote: >> Yann, Giovanni, >> >> >>> Which side effects are probable ? >> >> Giovanni has made the same comment on data.europeana.eu a couple of weeks >> ago. The data we serve there is different from the RDFa mark-up on our web >> portal. >> We had some reasons to do this, including, well, that the RDFa data is >> mixing the info and non-info resources for making easier data consumption >> (not mandatorily by search engines, btw), and working with URIs that >> pre-date our linked data service. >> >> The RDFa and the RDF obtained with LD-style conneg is also not about the >> same URIs, which should avoid any confusion. >> But I can understand that if Sindice tries to fetch both data sources, it >> may assume the data to be the same. And this assumption could bring a number >> of undesirable side effects if Sindice merges all what it gets... >> >> That being said, perhaps the solution lies in Sindice being less greedy ;-) >> and just work with the first data source it finds, for a given URI. >> I do like the idea of having several (simple) channels for data publication >> over the web, which serve different goals. >> Maybe we need to better articulate the practices and expectations, though... >> >> Cheers, >> >> Antoine >> >> >>> Hi Giovanni, >>> >>> Le 09/07/2011 23:10, Giovanni Tummarello a écrit : >>>> Hi Nicolas, >>>> >>>> Its getting in Sindice indeed - >>> Yes, I have noticed :) >>> >>>> quite politely e.g. 1 every 5 secs- >>>> we'll monitor speed and completeness. iff you think its ok for us to >>>> crawl faster please say so via robot.txt directive or just say so >>> May I suggest that you crawl twice faster ? >>> >>>> >>>> http://sindice.com/search?q=book&nq=&fq=domain%3Awww.sudoc.fr&sortbydate=1&interface=advanced >>>> >>>> at the same time i notice something funny in the markup e.g. if you go >>>> with a browser you get redirected to something that has almost no data >>>> >>>> for example the sitemap contains >>>> >>>> http://www.sudoc.fr/000000043 >>>> >>>> if you go there you get redirected to >>>> >>>> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043 >>>> >>>> which if you put in the inspector >>>> >>>> >>>> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.abes.fr%2FDB%3D2.1%2FSRCH%3FIKT%3D12%26TRM%3D000000043#TRIPLES >>>> >>>> you get very little data >>>> >>>> however of course if i use the inspector on >>>> http://www.sudoc.fr/000000043 i get data >>>> >>>> >>>> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.fr%2F000000043&content=&contentType=auto#TRIPLES >>>> >>>> which however is mostly schema.org data! >>>> >>>> but in sindice i have lots of RDF data with all sort of other ontologies >>>> >>>> http://sindice.com/search/page?url=http%3A%2F%2Fwww.sudoc.fr%2F000385123 >>>> >>>> is there any way you could try to normalize all into a single markup >>>> type? i think it would be easier to debug and ultimately better for >>>> all.. >>> I will try to explain our intention, our constraints and the mechanism >>> we've implemented. >>> >>> - Intention - >>> >>> We want to meet several needs : >>> . providing RDF/XML to semantic-oriented clients like Sindice >>> . providing HTML + schema.org microdata to traditional search engines like >>> Google >>> . providing an HTML UI to users >>> >>> >>> - Constraints - >>> >>> . For some reasons, we can't add microdata to our traditional Sudoc UI. >>> Hence the necessity of special HTML+microdata pages for search engines. :( >>> . HTML+microdata pages and RDF pages can't support the same vocabularies, >>> schema.org /oblige/. >>> >>> >>> - Mechanisms - >>> >>> Let's start from : http://www.sudoc.fr/132133520 >>> >>> . If RDF/XML is called by the request, we provide RDF/XML content (as if >>> you had requested http://www.sudoc.fr/132133520.rdf) >>> It is what Sindice Crawler is doing and getting : the 55,764 documents >>> that are found in your index are composed of triples extracted from this >>> RDF/XML page. It is what we expected. Fine :) >>> >>> . If our Apache server considers a user agent to be a robot and if this >>> agent does not ask for RDF/XML, we provide special HTML content (as if you >>> had requested http://www.sudoc.fr/132133520.html) >>> It seems to work as Google cache contains this kind of HTML + schema.org >>> microdata pages : >>> http://webcache.googleusercontent.com/search?q=cache:www.sudoc.fr/000000043 >>> >>> . In other cases, we redirect to our traditional and non semantic UI : >>> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043 >>> . NB : we have planned to add this<link> in this HTML page :<link >>> rel="alternate" type="application/rdf+xml" >>> href="http://www.sudoc.fr/000000043.rdf"/> and<link rel="canonical" >>> href="http://www.sudoc.fr/000000043"/> to alleviate the URL confusion. >>> >>> >>> - - - - - >>> >>> . It is not simple, but it seems to work, ie Google, Sindice and users >>> seem to get what they should. >>> . Is there a better way to obtain the same results ? >>> . Which side effects are probable ? >>> >>> Thanks for your help and your attention ! >>> >>> Yann >>> >>>> looking forward to support >>>> Giovanni >>>> Gio >>>> >>>> >>>> On Fri, Jul 8, 2011 at 1:27 PM, Kingsley Idehen<kidehen@openlinksw.com> >>>> wrote: >>>>> On 7/8/11 8:31 AM, Yann NICOLAS wrote: >>>>> >>>>> Le 08/07/2011 01:42, Kingsley Idehen a écrit : >>>>> >>>>> On 7/7/11 10:17 PM, Yann NICOLAS wrote: >>>>> >>>>> Bonjour, >>>>> >>>>> Sudoc [1], the French academic union catalogue maintained by ABES [2], >>>>> has >>>>> just been released as linked open data. >>>>> >>>>> 10 million bibliographic records are now available as RDF/XML. >>>>> >>>>> Examples for the Sudoc record whose internal id is 132133520 : >>>>> . Resource URI :http://www.sudoc.fr/132133520/id >>>>> . Generic document :http://www.sudoc.fr/132133520 (content negotiation >>>>> is >>>>> supported) >>>>> >>>>> >>>>> Great job! >>>>> >>>>> Is there an RDF dump anywhere? >>>>> >>>>> >>>>> Sorry, we don't provide any dump, as the 10 000 000 files are generated >>>>> on >>>>> the fly from Oracle (stored as XML type+ some more tables). >>>>> We provide a complete sitemap at >>>>> http://www.sudoc.fr/noticesbiblio/sitemap.txt , and we hope that >>>>> Sindice >>>>> will crawl the whole stuff. >>>>> Would it help ? >>>>> >>>>> Any advice welcome, >>>>> >>>>> Yann >>>>> >>>>> -- >>>>> -- >>>>> Yann NICOLAS >>>>> Etudes& Projets >>>>> ABES >>>>> >>>>> Okay, no problem with sitemaps as dump alternatives re. getting data >>>>> imported into Linked Data hubs such our LOD cloud cache and Sindice >>>>> etc.. >>>>> >>>>> >>>>> -- >>>>> >>>>> Regards, >>>>> >>>>> Kingsley Idehen >>>>> President& CEO >>>>> OpenLink Software >>>>> Web:http://www.openlinksw.com >>>>> Weblog:http://www.openlinksw.com/blog/~kidehen >>>>> Twitter/Identi.ca: kidehen >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>> -- >>> -- >>> Yann NICOLAS >>> Etudes& Projets >>> ABES >>> >> >> -- -- Yann NICOLAS Etudes& Projets ABES
Received on Tuesday, 12 July 2011 08:26:29 UTC