Re: ANN: Sudoc bibliographic ans authority data from Yann Nicolas on 2011-07-10 (public-lod@w3.org from July 2011)

From: Yann Nicolas <nicolas@abes.fr>
Date: Sun, 10 Jul 2011 12:44:27 +0200 (CEST)
To: Antoine Isaac <aisaac@few.vu.nl>
Cc: public-lod@w3.org
Message-ID: <3271041.95838.1310294667051.JavaMail.root@eole.abes.fr>
Hi Antoine and all

----- Mail original -----
> De: "Antoine Isaac" <aisaac@few.vu.nl>
> À: public-lod@w3.org
> Envoyé: Dimanche 10 Juillet 2011 12:22:11
> Objet: Re: ANN: Sudoc bibliographic ans authority data
> Yann, Giovanni,
> 
> 
> > Which side effects are probable ?
> 
> 
> Giovanni has made the same comment on data.europeana.eu a couple of
> weeks ago. The data we serve there is different from the RDFa mark-up
> on our web portal.
> We had some reasons to do this, including, well, that the RDFa data is
> mixing the info and non-info resources for making easier data
> consumption (not mandatorily by search engines, btw), and working with
> URIs that pre-date our linked data service.
> 
> The RDFa and the RDF obtained with LD-style conneg is also not about
> the same URIs, which should avoid any confusion.


I can't see any "confusion" if you publish complementary data about the same resource URI (in our case) through complementary technologies.
I can imagine a burden for crawlers and other data consumers, but where is the confusion ?

> But I can understand that if Sindice tries to fetch both data sources,
> it may assume the data to be the same. And this assumption could bring
> a number of undesirable side effects if Sindice merges all what it
> gets...

In our case, i don't see the *risk* of merging, but my point of view is maybe too narrow.

> 
> That being said, perhaps the solution lies in Sindice being less
> greedy ;-) and just work with the first data source it finds, for a
> given URI.


Sindice is actually very temperate : it take only our RDF/XML data :)

Yann

> I do like the idea of having several (simple) channels for data
> publication over the web, which serve different goals.
> Maybe we need to better articulate the practices and expectations,
> though...
> 
> Cheers,
> 
> Antoine
> 
> 
> >   Hi Giovanni,
> >
> > Le 09/07/2011 23:10, Giovanni Tummarello a écrit :
> >> Hi Nicolas,
> >>
> >> Its getting in Sindice indeed -
> >
> > Yes, I have noticed :)
> >
> >> quite politely e.g. 1 every 5 secs-
> >> we'll monitor speed and completeness. iff you think its ok for us
> >> to
> >> crawl faster please say so via robot.txt directive or just say so
> > May I suggest that you crawl twice faster ?
> >
> >>
> >> http://sindice.com/search?q=book&nq=&fq=domain%3Awww.sudoc.fr&sortbydate=1&interface=advanced
> >>
> >> at the same time i notice something funny in the markup e.g. if you
> >> go
> >> with a browser you get redirected to something that has almost no
> >> data
> >>
> >> for example the sitemap contains
> >>
> >> http://www.sudoc.fr/000000043
> >>
> >> if you go there you get redirected to
> >>
> >> http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043
> >>
> >> which if you put in the inspector
> >>
> >> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.abes.fr%2FDB%3D2.1%2FSRCH%3FIKT%3D12%26TRM%3D000000043#TRIPLES
> >>
> >> you get very little data
> >>
> >> however of course if i use the inspector on
> >> http://www.sudoc.fr/000000043 i get data
> >>
> >> http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.fr%2F000000043&content=&contentType=auto#TRIPLES
> >>
> >> which however is mostly schema.org data!
> >>
> >> but in sindice i have lots of RDF data with all sort of other
> >> ontologies
> >>
> >> http://sindice.com/search/page?url=http%3A%2F%2Fwww.sudoc.fr%2F000385123
> >>
> >> is there any way you could try to normalize all into a single
> >> markup
> >> type? i think it would be easier to debug and ultimately better for
> >> all..
> >
> > I will try to explain our intention, our constraints and the
> > mechanism we've implemented.
> >
> > - Intention -
> >
> > We want to meet several needs :
> > . providing RDF/XML to semantic-oriented clients like Sindice
> > . providing HTML + schema.org microdata to traditional search
> > engines like Google
> > . providing an HTML UI to users
> >
> >
> > - Constraints -
> >
> > . For some reasons, we can't add microdata to our traditional Sudoc
> > UI. Hence the necessity of special HTML+microdata pages for search
> > engines. :(
> > . HTML+microdata pages and RDF pages can't support the same
> > vocabularies, schema.org /oblige/.
> >
> >
> > - Mechanisms -
> >
> > Let's start from : http://www.sudoc.fr/132133520
> >
> > . If RDF/XML is called by the request, we provide RDF/XML content
> > (as if you had requested http://www.sudoc.fr/132133520.rdf)
> > It is what Sindice Crawler is doing and getting : the 55,764
> > documents that are found in your index are composed of triples
> > extracted from this RDF/XML page. It is what we expected. Fine :)
> >
> > . If our Apache server considers a user agent to be a robot and if
> > this agent does not ask for RDF/XML, we provide special HTML content
> > (as if you had requested http://www.sudoc.fr/132133520.html)
> > It seems to work as Google cache contains this kind of HTML +
> > schema.org microdata pages :
> > http://webcache.googleusercontent.com/search?q=cache:www.sudoc.fr/000000043
> >
> > . In other cases, we redirect to our traditional and non semantic UI
> > : http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043
> > . NB : we have planned to add this <link> in this HTML page : <link
> > rel="alternate" type="application/rdf+xml"
> > href="http://www.sudoc.fr/000000043.rdf"/> and <link rel="canonical"
> > href="http://www.sudoc.fr/000000043"/> to alleviate the URL
> > confusion.
> >
> >
> > - - - - -
> >
> > . It is not simple, but it seems to work, ie Google, Sindice and
> > users seem to get what they should.
> > . Is there a better way to obtain the same results ?
> > . Which side effects are probable ?
> >
> > Thanks for your help and your attention !
> >
> > Yann
> >
> >>
> >> looking forward to support
> >> Giovanni
> >> Gio
> >>
> >>
> >> On Fri, Jul 8, 2011 at 1:27 PM, Kingsley
> >> Idehen<kidehen@openlinksw.com> wrote:
> >>> On 7/8/11 8:31 AM, Yann NICOLAS wrote:
> >>>
> >>> Le 08/07/2011 01:42, Kingsley Idehen a écrit :
> >>>
> >>> On 7/7/11 10:17 PM, Yann NICOLAS wrote:
> >>>
> >>> Bonjour,
> >>>
> >>> Sudoc [1], the French academic union catalogue maintained by ABES
> >>> [2], has
> >>> just been released as linked open data.
> >>>
> >>> 10 million bibliographic records are now available as RDF/XML.
> >>>
> >>> Examples for the Sudoc record whose internal id is 132133520 :
> >>> . Resource URI :http://www.sudoc.fr/132133520/id
> >>> . Generic document :http://www.sudoc.fr/132133520 (content
> >>> negotiation is
> >>> supported)
> >>>
> >>>
> >>> Great job!
> >>>
> >>> Is there an RDF dump anywhere?
> >>>
> >>>
> >>> Sorry, we don't provide any dump, as the 10 000 000 files are
> >>> generated on
> >>> the fly from Oracle (stored as XML type+ some more tables).
> >>> We provide a complete sitemap at
> >>> http://www.sudoc.fr/noticesbiblio/sitemap.txt , and we hope that
> >>> Sindice
> >>> will crawl the whole stuff.
> >>> Would it help ?
> >>>
> >>> Any advice welcome,
> >>>
> >>> Yann
> >>>
> >>> --
> >>> --
> >>> Yann NICOLAS
> >>> Etudes& Projets
> >>> ABES
> >>>
> >>> Okay, no problem with sitemaps as dump alternatives re. getting
> >>> data
> >>> imported into Linked Data hubs such our LOD cloud cache and
> >>> Sindice etc..
> >>>
> >>>
> >>> --
> >>>
> >>> Regards,
> >>>
> >>> Kingsley Idehen
> >>> President& CEO
> >>> OpenLink Software
> >>> Web:http://www.openlinksw.com
> >>> Weblog:http://www.openlinksw.com/blog/~kidehen
> >>> Twitter/Identi.ca: kidehen
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> >
> > --
> > --
> > Yann NICOLAS
> > Etudes& Projets
> > ABES
> >
Received on Sunday, 10 July 2011 10:44:58 UTC