- From: Yann Nicolas <nicolas@abes.fr>
- Date: Sun, 10 Jul 2011 12:44:27 +0200 (CEST)
- To: Antoine Isaac <aisaac@few.vu.nl>
- Cc: public-lod@w3.org
Hi Antoine and all ----- Mail original ----- > De: "Antoine Isaac" <aisaac@few.vu.nl> > À: public-lod@w3.org > Envoyé: Dimanche 10 Juillet 2011 12:22:11 > Objet: Re: ANN: Sudoc bibliographic ans authority data > Yann, Giovanni, > > > > Which side effects are probable ? > > > Giovanni has made the same comment on data.europeana.eu a couple of > weeks ago. The data we serve there is different from the RDFa mark-up > on our web portal. > We had some reasons to do this, including, well, that the RDFa data is > mixing the info and non-info resources for making easier data > consumption (not mandatorily by search engines, btw), and working with > URIs that pre-date our linked data service. > > The RDFa and the RDF obtained with LD-style conneg is also not about > the same URIs, which should avoid any confusion. I can't see any "confusion" if you publish complementary data about the same resource URI (in our case) through complementary technologies. I can imagine a burden for crawlers and other data consumers, but where is the confusion ? > But I can understand that if Sindice tries to fetch both data sources, > it may assume the data to be the same. And this assumption could bring > a number of undesirable side effects if Sindice merges all what it > gets... In our case, i don't see the *risk* of merging, but my point of view is maybe too narrow. > > That being said, perhaps the solution lies in Sindice being less > greedy ;-) and just work with the first data source it finds, for a > given URI. Sindice is actually very temperate : it take only our RDF/XML data :) Yann > I do like the idea of having several (simple) channels for data > publication over the web, which serve different goals. > Maybe we need to better articulate the practices and expectations, > though... > > Cheers, > > Antoine > > > > Hi Giovanni, > > > > Le 09/07/2011 23:10, Giovanni Tummarello a écrit : > >> Hi Nicolas, > >> > >> Its getting in Sindice indeed - > > > > Yes, I have noticed :) > > > >> quite politely e.g. 1 every 5 secs- > >> we'll monitor speed and completeness. iff you think its ok for us > >> to > >> crawl faster please say so via robot.txt directive or just say so > > May I suggest that you crawl twice faster ? > > > >> > >> http://sindice.com/search?q=book&nq=&fq=domain%3Awww.sudoc.fr&sortbydate=1&interface=advanced > >> > >> at the same time i notice something funny in the markup e.g. if you > >> go > >> with a browser you get redirected to something that has almost no > >> data > >> > >> for example the sitemap contains > >> > >> http://www.sudoc.fr/000000043 > >> > >> if you go there you get redirected to > >> > >> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043 > >> > >> which if you put in the inspector > >> > >> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.abes.fr%2FDB%3D2.1%2FSRCH%3FIKT%3D12%26TRM%3D000000043#TRIPLES > >> > >> you get very little data > >> > >> however of course if i use the inspector on > >> http://www.sudoc.fr/000000043 i get data > >> > >> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.fr%2F000000043&content=&contentType=auto#TRIPLES > >> > >> which however is mostly schema.org data! > >> > >> but in sindice i have lots of RDF data with all sort of other > >> ontologies > >> > >> http://sindice.com/search/page?url=http%3A%2F%2Fwww.sudoc.fr%2F000385123 > >> > >> is there any way you could try to normalize all into a single > >> markup > >> type? i think it would be easier to debug and ultimately better for > >> all.. > > > > I will try to explain our intention, our constraints and the > > mechanism we've implemented. > > > > - Intention - > > > > We want to meet several needs : > > . providing RDF/XML to semantic-oriented clients like Sindice > > . providing HTML + schema.org microdata to traditional search > > engines like Google > > . providing an HTML UI to users > > > > > > - Constraints - > > > > . For some reasons, we can't add microdata to our traditional Sudoc > > UI. Hence the necessity of special HTML+microdata pages for search > > engines. :( > > . HTML+microdata pages and RDF pages can't support the same > > vocabularies, schema.org /oblige/. > > > > > > - Mechanisms - > > > > Let's start from : http://www.sudoc.fr/132133520 > > > > . If RDF/XML is called by the request, we provide RDF/XML content > > (as if you had requested http://www.sudoc.fr/132133520.rdf) > > It is what Sindice Crawler is doing and getting : the 55,764 > > documents that are found in your index are composed of triples > > extracted from this RDF/XML page. It is what we expected. Fine :) > > > > . If our Apache server considers a user agent to be a robot and if > > this agent does not ask for RDF/XML, we provide special HTML content > > (as if you had requested http://www.sudoc.fr/132133520.html) > > It seems to work as Google cache contains this kind of HTML + > > schema.org microdata pages : > > http://webcache.googleusercontent.com/search?q=cache:www.sudoc.fr/000000043 > > > > . In other cases, we redirect to our traditional and non semantic UI > > : http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043 > > . NB : we have planned to add this <link> in this HTML page : <link > > rel="alternate" type="application/rdf+xml" > > href="http://www.sudoc.fr/000000043.rdf"/> and <link rel="canonical" > > href="http://www.sudoc.fr/000000043"/> to alleviate the URL > > confusion. > > > > > > - - - - - > > > > . It is not simple, but it seems to work, ie Google, Sindice and > > users seem to get what they should. > > . Is there a better way to obtain the same results ? > > . Which side effects are probable ? > > > > Thanks for your help and your attention ! > > > > Yann > > > >> > >> looking forward to support > >> Giovanni > >> Gio > >> > >> > >> On Fri, Jul 8, 2011 at 1:27 PM, Kingsley > >> Idehen<kidehen@openlinksw.com> wrote: > >>> On 7/8/11 8:31 AM, Yann NICOLAS wrote: > >>> > >>> Le 08/07/2011 01:42, Kingsley Idehen a écrit : > >>> > >>> On 7/7/11 10:17 PM, Yann NICOLAS wrote: > >>> > >>> Bonjour, > >>> > >>> Sudoc [1], the French academic union catalogue maintained by ABES > >>> [2], has > >>> just been released as linked open data. > >>> > >>> 10 million bibliographic records are now available as RDF/XML. > >>> > >>> Examples for the Sudoc record whose internal id is 132133520 : > >>> . Resource URI :http://www.sudoc.fr/132133520/id > >>> . Generic document :http://www.sudoc.fr/132133520 (content > >>> negotiation is > >>> supported) > >>> > >>> > >>> Great job! > >>> > >>> Is there an RDF dump anywhere? > >>> > >>> > >>> Sorry, we don't provide any dump, as the 10 000 000 files are > >>> generated on > >>> the fly from Oracle (stored as XML type+ some more tables). > >>> We provide a complete sitemap at > >>> http://www.sudoc.fr/noticesbiblio/sitemap.txt , and we hope that > >>> Sindice > >>> will crawl the whole stuff. > >>> Would it help ? > >>> > >>> Any advice welcome, > >>> > >>> Yann > >>> > >>> -- > >>> -- > >>> Yann NICOLAS > >>> Etudes& Projets > >>> ABES > >>> > >>> Okay, no problem with sitemaps as dump alternatives re. getting > >>> data > >>> imported into Linked Data hubs such our LOD cloud cache and > >>> Sindice etc.. > >>> > >>> > >>> -- > >>> > >>> Regards, > >>> > >>> Kingsley Idehen > >>> President& CEO > >>> OpenLink Software > >>> Web:http://www.openlinksw.com > >>> Weblog:http://www.openlinksw.com/blog/~kidehen > >>> Twitter/Identi.ca: kidehen > >>> > >>> > >>> > >>> > >>> > > > > > > -- > > -- > > Yann NICOLAS > > Etudes& Projets > > ABES > >
Received on Sunday, 10 July 2011 10:44:58 UTC