Re: Linked data sets for evaluating interlinking?

Hi Cristina,

(Sorry to be so late to this thread, but I have been on vacation.)

I agree with the points that Hugh has made.

I currently work on the problem of managing co-reference within the Open PHACTS project which is developing a linked data platform for drug discovery [1]. Within the scope of this project we have developed – in Manchester and Maastricht – an Identity Mapping Service that uses linksets between datasets to provide details of operationally equivalent URIs. (Our goal is to enable scientist to switch between different notions of operational equivalence [2,3].) You can access the mapping service through the Open PHACTS API using the mapURL method [4].

What will be of interest to you is that we have linksets that relate pairs of datasets for pharmacology datasets [5]. All of them should have VoID descriptions provided with them (some are currently missing and we are working on that), and these descriptions in version 1.3 include details of the interpretation of the operational equivalence (mostly in the CRS linksets). Details of the dataset description standard that we are following in Open PHACTS are available from [6].

Hope this all helps you. I'm happy to answer any questions you may have.

Alasdair

[1] http://www.openphacts.org
[2] http://ceur-ws.org/Vol-951/paper5.pdf
[3] http://www.mendeley.com/download/public/3386741/6069407874/3cb6576c86e8459836526ab4856264f4be251a53/dl.pdf
[4] https://dev.openphacts.org/docs
[5] http://openphacts.cs.man.ac.uk/ims/linkset/
[6] http://www.cs.man.ac.uk/~graya/ops/2012/ED-datadesc/

On 26 Aug 2013, at 20:08, Hugh Glaser <hg@ecs.soton.ac.uk> wrote:

> Hi Cristina,
> Some interesting issues you raise.
> One of them is how people publish links (which enables your analysis).
> There are two ways this happens.
> 1) People add triples to their dataset that have an equivalence predicate (owl:sameAs, skos:exactMatch, skos:closeMatch, etc.)
> 2) People use a "foreign" URI (very commonly a dbpedia URI), because when turning their data into RDF they have decided that the entity they are concerned with is the same as the dbpedia one. The second paragraph of Tom's message describes such a linkage, I think.
> I think these distinctions are behind the comments of Milorad, where he is assuming the type (2) way.
> Either of these methods should be fodder for you, and you may well find that the type (2) way is used by a dataset that is useful to you.
> It may be harder for you to process, as the linkage is not so explicit because there is no distinct URI for the resource in the database, different from the "foreign" one. But any "foreign" URI is in fact a link.
> You will find that people have tended towards type (2) linkage because they can shy away from having lots of equivalence predicates in their datasets, not least because there was a time when RDF stores did not comfortably do owl:sameAs inference, and so they do the linking at RDF conversion time, and use "foreign" URIs.
> 
> Another interesting issue is more fundamental to your work.
> You seem to think that there must be a "gold standard or reference interlinking" for equivalence.
> As long-time readers of this list will have seen discussed many times (!), it is not a simple matter.
> It is a complex matter to have such a thing, which is a necessity for you to do your precision/recall statistics.
> At its most basic, for example, am I as a private citizen the same as me as a member of my University or me as a member of my company?
> The answer is, of course yes and no.
> Another field that has spent a lot of time on this is the FRBR world (http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records).
> If I have a book of the Semantic Web, is it the same as your book of the same name?
> Perhaps. What if it is a different (corrected) edition? An electronic version?
> Certainly a library will usually consider each book a different thing, but if you are asking how many books the author has published, you want to treat all the books as the same resource.
> 
> So in asking for a "gold standard or reference interlinking", I think you are chasing a chimera.
> What you can do is choose datasets and then you will need to find out what the policies of the equivalence creators; and then you will need to build your system so that it implements the same policies.
> By the way, policies usually relate to the way in which the dataset will be used, rather than the wishes of the publisher of the data - there is no absolute truth in this. Some would argue there is never any equivalence: "One cannot step once (sic) into the same stream" (http://en.wikipedia.org/wiki/Cratylus)
> 
> It's great you have asked the question - convincing research in this field is very challenging!
> 
> Best
> Hugh
> 
> On 26 Aug 2013, at 14:16, Tom Elliott <tom.elliott@nyu.edu>
> wrote:
> 
>> Hi all:
>> 
>> Two humanities datasets of potential interest in this regard:
>> 
>> A number of datasets (around 20 different ones I think) related to the study of antiquity have aligned their geographic/toponymic fields with the Pleiades gazetteer (http://pleiades.stoa.org) and published RDF accordingly. Most of this work has been done under the auspices of something called the Pelagios Project, and the alignment processes used by many of the participants are documented in blog posts at http://pelagios-project.blogspot.com/ (most of them a combination of automated and manual). Pleiades itself is also a linked data resource, and has a growing number (still only a small percentage of its content) of outbound links to dbpedia, geonames, and OSM. All of those outbound links are hand-curated. Contributors to Pleiades, where possible, are aligned to VIAF (manually) and bibliography in Pleiades is also beginning to be aligned to the Open Library and Worldcat (again, manually).
>> 
>> On a much smaller scale, I offer the "About Roman Emperors" dataset, which rather than minting its own URIs for the Roman emperors, uses the dbpedia resource URIs for each: http://www.paregorios.org/resources/roman-emperors/. The primary purpose of the dataset is to provide a comprehensive list of these for easy access and reuse by third parties, and to associate the dbpedia URIs with corresponding Roman imperial mint and minting authority data in nomisma.org and finds.org.uk, and to a static, late-90s-vintage scholarly encyclopedia of Roman emperors: http://www.roman-emperors.org/
>> 
>> Tom
>> 
>> 
>> Tom Elliott, Ph.D.
>> Associate Director for Digital Programs and Senior Research Scholar
>> Institute for the Study of the Ancient World (NYU)
>> http://isaw.nyu.edu/people/staff/tom-elliott
>> 
>> 
>> 
>> On Aug 26, 2013, at 6:04 AM, Adrian Stevenson wrote:
>> 
>>> Hi All
>>> 
>>> As part of the LOCAH and Linking Lives projects, the latter in particular, we've being doing a lot of this auto and manual linking work, mainly to VIAF and DBPedia, with some links to things like LCSH and Geonames. We've been doing a lot of work just recently in fact, and we've published a blog post that's picked up quite a bit of interest on this - http://archiveshub.ac.uk/blog/2013/08/hub-viaf-namematching/. We haven't published our latest run of data yet, but we hope to finish this soon. It'll probably still be about a month or so as a few of us are on holiday soon.
>>> 
>>> We do have quite a few links done semi-automatically in our existing data set accessible via http://data.archiveshub.ac.uk but as I say we are updating this, I'd suggest not taking the URIs and data available there as the final word.
>>> 
>>> A good example is http://data.archiveshub.ac.uk/page/person/nra/webbmarthabeatrice1858-1943socialreformer
>>> 
>>> Project URIs:
>>> http://archiveshub.ac.uk/locah/
>>> http://archiveshub.ac.uk/linkinglives/
>>> 
>>> Adrian
>>> _____________________________
>>> Adrian Stevenson
>>> Senior Technical Innovations Coordinator
>>> Mimas, The University of Manchester
>>> Devonshire House, Oxford Road
>>> Manchester M13 9QH
>>> 
>>> Email: adrian.stevenson@manchester.ac.uk
>>> Tel: +44 (0) 161 275 6065
>>> http://www.mimas.ac.uk
>>> http://www.twitter.com/adrianstevenson
>>> http://uk.linkedin.com/in/adrianstevenson/
>>> 
>>> On 22 Aug 2013, at 16:06, Cristina Sarasua wrote:
>>> 
>>>> Hi, 
>>>> 
>>>> I am looking for pairs of linked data sets that can be used as gold standard for evaluations.  I would need pairs of data sets which have been manually linked, or data sets which have been (semi-)automatically linked with interlinking tools, and afterwards reviewed (to include the links which are not identified by tools). I have looked into the DataHub catalogue and queried VoiD descriptions, but unfortunately the information about how the interlinking process was carried out is often missing.
>>>> 
>>>> Apart from the data sets which have been used in the OAEI-instance matching track, could anyone recommend (based on past experience) good data sets for evaluating data interlinking processes?
>>>> 
>>>> Thanks in advance.
>>>> 
>>>> Kind regards, 
>>>> 
>>>> Cristina
>>>> -- 
>>>> Cristina Sarasua
>>>> 
>>>> Institute for Web Science and Technologies (WeST)
>>>> 
>>>> Universität Koblenz-Landau
>>>> Universitätsstraße 1
>>>> 56070 Koblenz
>>>> Germany
>>>> 
>>>> e: 
>>>> csarasua@uni-koblenz.de
>>>> 
>>>> p: +49 261 287 2772
>>>> f: +49 261 287 100 2772
>>>> w: 
>>>> http://west.uni-koblenz.de 
>>> 
>>> 
>> 
>> 
>> 
> 
> 

Dr Alasdair J G Gray
Research Associate
Alasdair.Gray@manchester.ac.uk
+44 161 275 0145

http://www.cs.man.ac.uk/~graya/

Please consider the environment before printing this email.

Received on Friday, 6 September 2013 10:17:12 UTC