- From: <Csarasua@uni-koblenz.de>
- Date: Tue, 27 Aug 2013 17:57:37 +0200
- To: "Hugh Glaser" <hg@ecs.soton.ac.uk>
- Cc: "Cristina Sarasua" <csarasua@uni-koblenz.de>, "Linked Data community" <public-lod@w3.org>
Hi Hugh, >Hi Cristina, > Some interesting issues you raise. > One of them is how people publish links (which enables your analysis). > There are two ways this happens. > 1) People add triples to their dataset that have an equivalence predicate (owl:sameAs, skos:exactMatch, skos:closeMatch, etc.) > 2) People use a "foreign" URI (very commonly a dbpedia URI), because when turning their data into RDF they have decided that the entity they are concerned with is the same as the dbpedia one. The second paragraph of Tom's message describes such a linkage, I think. > I think these distinctions are behind the comments of Milorad, where he is assuming the type (2) way. > Either of these methods should be fodder for you, and you may well find that the type (2) way is used by a dataset that is useful to you. > > I agree, it is important to distinguish between different types of links. When I refer to interlinking I have in mind triples (s, p, o), where "s" and "o" are resources from different data sets, and "p" is either a property like owl:sameAs or a domain-specific property like foaf:knows. I think this corresponds to what you specified in 1) and 2). I would like to have both kinds of links in my evaluation (if possible). > It may be harder for you to process, as the linkage is not so explicit because there is no distinct URI for the resource in the database, different from the "foreign" one. But any "foreign" URI is in fact a link. > You will find that people have tended towards type (2) linkage because they can shy away from having lots of equivalence predicates in their datasets, not least because there was a time when RDF stores did not comfortably do owl:sameAs inference, and so they do the linking at RDF conversion time, and use "foreign" URIs. > > Another interesting issue is more fundamental to your work. > You seem to think that there must be a "gold standard or reference interlinking" for equivalence. > As long-time readers of this list will have seen discussed many times (!), it is not a simple matter. > > It is a complex matter to have such a thing, which is a necessity for you to do your precision/recall statistics. > At its most basic, for example, am I as a private citizen the same as me as a member of my University or me as a member of my company? > The answer is, of course yes and no. > Another field that has spent a lot of time on this is the FRBR world (http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records). > If I have a book of the Semantic Web, is it the same as your book of the same name? > Perhaps. What if it is a different (corrected) edition? An electronic version? > Certainly a library will usually consider each book a different thing, but if you are asking how many books the author has published, you want to treat all the books as the same resource. > I understand the point, and I find it very interesting, indeed. I guess that it might depend on the context where the data was created / will be used. This reminds me of the paper about the analysis of identity links ( http://www.w3.org/2009/12/rdf-ws/papers/ws21). However, I think that it is possible to evaluate different interlinking techniques, establishing some gold standard (e.g. the links between the cities of a data set describing the population of European cities and a data set describing the cities as tourist attractions), to be able to analyse the results in terms of precision and recall, and say that one tool is able to certain things, while the other not. Regarding the humans behind the manual definition or the reviewing process of a reference interlinking, I would expect them to be knowledgeable (i.e. domain experts should have been part of the process at least). > > So in asking for a "gold standard or reference interlinking", I think you are chasing a chimera. > What you can do is choose datasets and then you will need to find out what the policies of the equivalence creators; and then you will need to build your system so that it implements the same policies. > By the way, policies usually relate to the way in which the dataset will be used, rather than the wishes of the publisher of the data - there is no absolute truth in this. Some would argue there is never any equivalence: "One cannot step once (sic) into the same stream" (http://en.wikipedia.org/wiki/Cratylus) > Thanks for the suggestion. > It's great you have asked the question - convincing research in this field is very challenging! > > Best > Hugh > Best, Cristina
Received on Tuesday, 27 August 2013 15:58:02 UTC