- From: Giovanni Tummarello <giovanni.tummarello@deri.org>
- Date: Mon, 11 Jul 2011 00:29:30 +0200
- To: Antoine Isaac <aisaac@few.vu.nl>
- Cc: public-lod@w3.org
hi Antoine, Yann all my advice is to keep it simple and complete. very simple indeed. Please forget about content negotiation. It was a horrible idea all alone, it doesn't work because it WILL break since no humans are looking at it. Really: anything that redirects and changes the URL when you put it in a browser is just so wrong have 1 single version of the page with rdfa+schema.org i know they say dont do that on schema.org but they're just being silly they will read microdata anyway (the schema part) the rdfa part its 1 line of code to extract if they want to do so if they dont who cares - they only care about the schema part anyway, let others use the rdf/a in terms of full crawling, if you allow of 1 url per second should be sustained this way data would be in in 3 months or so which seems still ridicolous but thats what search engine do. if you have the proper lastupdatd set that's great the updates will be just incremental otherwise yes a dump would allow us to ingest all in full but it is a manual operation betwen us and you these are my advices, this said i know that one might have several ideas/motivs etc which might be different from what these advices suggest. worry not. whoever consumes data better get ready to be pretty flexible, so we take all you offer really :) cheers Giovanni On Sun, Jul 10, 2011 at 12:22 PM, Antoine Isaac <aisaac@few.vu.nl> wrote: > Yann, Giovanni, > > >> Which side effects are probable ? > > > Giovanni has made the same comment on data.europeana.eu a couple of weeks > ago. The data we serve there is different from the RDFa mark-up on our web > portal. > We had some reasons to do this, including, well, that the RDFa data is > mixing the info and non-info resources for making easier data consumption > (not mandatorily by search engines, btw), and working with URIs that > pre-date our linked data service. > > The RDFa and the RDF obtained with LD-style conneg is also not about the > same URIs, which should avoid any confusion. > But I can understand that if Sindice tries to fetch both data sources, it > may assume the data to be the same. And this assumption could bring a number > of undesirable side effects if Sindice merges all what it gets... > > That being said, perhaps the solution lies in Sindice being less greedy ;-) > and just work with the first data source it finds, for a given URI. > I do like the idea of having several (simple) channels for data publication > over the web, which serve different goals. > Maybe we need to better articulate the practices and expectations, though... > > Cheers, > > Antoine > > >> Hi Giovanni, >> >> Le 09/07/2011 23:10, Giovanni Tummarello a écrit : >>> >>> Hi Nicolas, >>> >>> Its getting in Sindice indeed - >> >> Yes, I have noticed :) >> >>> quite politely e.g. 1 every 5 secs- >>> we'll monitor speed and completeness. iff you think its ok for us to >>> crawl faster please say so via robot.txt directive or just say so >> >> May I suggest that you crawl twice faster ? >> >>> >>> >>> http://sindice.com/search?q=book&nq=&fq=domain%3Awww.sudoc.fr&sortbydate=1&interface=advanced >>> >>> at the same time i notice something funny in the markup e.g. if you go >>> with a browser you get redirected to something that has almost no data >>> >>> for example the sitemap contains >>> >>> http://www.sudoc.fr/000000043 >>> >>> if you go there you get redirected to >>> >>> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043 >>> >>> which if you put in the inspector >>> >>> >>> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.abes.fr%2FDB%3D2.1%2FSRCH%3FIKT%3D12%26TRM%3D000000043#TRIPLES >>> >>> you get very little data >>> >>> however of course if i use the inspector on >>> http://www.sudoc.fr/000000043 i get data >>> >>> >>> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.fr%2F000000043&content=&contentType=auto#TRIPLES >>> >>> which however is mostly schema.org data! >>> >>> but in sindice i have lots of RDF data with all sort of other ontologies >>> >>> http://sindice.com/search/page?url=http%3A%2F%2Fwww.sudoc.fr%2F000385123 >>> >>> is there any way you could try to normalize all into a single markup >>> type? i think it would be easier to debug and ultimately better for >>> all.. >> >> I will try to explain our intention, our constraints and the mechanism >> we've implemented. >> >> - Intention - >> >> We want to meet several needs : >> . providing RDF/XML to semantic-oriented clients like Sindice >> . providing HTML + schema.org microdata to traditional search engines like >> Google >> . providing an HTML UI to users >> >> >> - Constraints - >> >> . For some reasons, we can't add microdata to our traditional Sudoc UI. >> Hence the necessity of special HTML+microdata pages for search engines. :( >> . HTML+microdata pages and RDF pages can't support the same vocabularies, >> schema.org /oblige/. >> >> >> - Mechanisms - >> >> Let's start from : http://www.sudoc.fr/132133520 >> >> . If RDF/XML is called by the request, we provide RDF/XML content (as if >> you had requested http://www.sudoc.fr/132133520.rdf) >> It is what Sindice Crawler is doing and getting : the 55,764 documents >> that are found in your index are composed of triples extracted from this >> RDF/XML page. It is what we expected. Fine :) >> >> . If our Apache server considers a user agent to be a robot and if this >> agent does not ask for RDF/XML, we provide special HTML content (as if you >> had requested http://www.sudoc.fr/132133520.html) >> It seems to work as Google cache contains this kind of HTML + schema.org >> microdata pages : >> http://webcache.googleusercontent.com/search?q=cache:www.sudoc.fr/000000043 >> >> . In other cases, we redirect to our traditional and non semantic UI : >> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043 >> . NB : we have planned to add this <link> in this HTML page : <link >> rel="alternate" type="application/rdf+xml" >> href="http://www.sudoc.fr/000000043.rdf"/> and <link rel="canonical" >> href="http://www.sudoc.fr/000000043"/> to alleviate the URL confusion. >> >> >> - - - - - >> >> . It is not simple, but it seems to work, ie Google, Sindice and users >> seem to get what they should. >> . Is there a better way to obtain the same results ? >> . Which side effects are probable ? >> >> Thanks for your help and your attention ! >> >> Yann >> >>> >>> looking forward to support >>> Giovanni >>> Gio >>> >>> >>> On Fri, Jul 8, 2011 at 1:27 PM, Kingsley Idehen<kidehen@openlinksw.com> >>> wrote: >>>> >>>> On 7/8/11 8:31 AM, Yann NICOLAS wrote: >>>> >>>> Le 08/07/2011 01:42, Kingsley Idehen a écrit : >>>> >>>> On 7/7/11 10:17 PM, Yann NICOLAS wrote: >>>> >>>> Bonjour, >>>> >>>> Sudoc [1], the French academic union catalogue maintained by ABES [2], >>>> has >>>> just been released as linked open data. >>>> >>>> 10 million bibliographic records are now available as RDF/XML. >>>> >>>> Examples for the Sudoc record whose internal id is 132133520 : >>>> . Resource URI :http://www.sudoc.fr/132133520/id >>>> . Generic document :http://www.sudoc.fr/132133520 (content negotiation >>>> is >>>> supported) >>>> >>>> >>>> Great job! >>>> >>>> Is there an RDF dump anywhere? >>>> >>>> >>>> Sorry, we don't provide any dump, as the 10 000 000 files are generated >>>> on >>>> the fly from Oracle (stored as XML type+ some more tables). >>>> We provide a complete sitemap at >>>> http://www.sudoc.fr/noticesbiblio/sitemap.txt , and we hope that >>>> Sindice >>>> will crawl the whole stuff. >>>> Would it help ? >>>> >>>> Any advice welcome, >>>> >>>> Yann >>>> >>>> -- >>>> -- >>>> Yann NICOLAS >>>> Etudes& Projets >>>> ABES >>>> >>>> Okay, no problem with sitemaps as dump alternatives re. getting data >>>> imported into Linked Data hubs such our LOD cloud cache and Sindice >>>> etc.. >>>> >>>> >>>> -- >>>> >>>> Regards, >>>> >>>> Kingsley Idehen >>>> President& CEO >>>> OpenLink Software >>>> Web:http://www.openlinksw.com >>>> Weblog:http://www.openlinksw.com/blog/~kidehen >>>> Twitter/Identi.ca: kidehen >>>> >>>> >>>> >>>> >>>> >> >> >> -- >> -- >> Yann NICOLAS >> Etudes& Projets >> ABES >> > > >
Received on Sunday, 10 July 2011 22:30:17 UTC