Re: Linked data sets for evaluating interlinking?

Hi Hugh,


> Hi,
> Thanks.
> Just one comment, relating to the cities example you use.
> The paper you cite mentions cities and says: "For example, the city of
> Paris is referenced in a number of different Linked Data-sets: ranging
> from OpenCyc to the New York Times. In DBPedia, a Linked Data export of
> Wikipedia, these data-sets are connected by owl:sameAs. In particular,
> dbpedia:Paris is owl:sameAs as both the opencyc:CityOfParisFrance and
> opencyc:Paris DepartmentFrance, as OpenCyc distinguishes that “the
> department of Paris. Paris DepartmentFrance is a distinct geopolitical
> entity from CityOfParisFrance, despite the fact that both share the same
> territory, while Wikipedia does not make this distinction."
>

True. They mention such an example, and they also encourage the use of
something like skos:closeMatch for the department of Paris. I guess
different things could be analysed; for example, the use of properties and
the matching techniques. And you are right, depending on the reference I
choose I will observe more or less useful things.
There can be data sets which do not contain these delicate links, and they
are still worthwhile for comparing different ways of performing data
interlinking.

In any case, I really appreciate this kind of discussion.

Best,
Cristina

t there are  where the cities do not
> So even cities (actually especially cities and other geo things) have
> significant challenges here.
> Geo-political v. geographic v. the geo-extent v. the nounSynset etc.
> And we haven't even mentioned temporal aspects.
>
> So I do worry about all this.
> If the dataset is simple enough that you can ignore the problems, then the
> question is if the exercise tells you anything useful.
> If the dataset is more complicated, for example having both geo-political
> and geographic and wanting to keep them separate, then it is also a
> question is if the exercise tells you anything useful!
>
> But if something is hard and challenging it is more reason to do it, I
> guess.
> Good luck.
> Hugh
>
> On 27 Aug 2013, at 16:57, <Csarasua@uni-koblenz.de>
>  wrote:
>
>> Hi Hugh,
>>
>>
>>> Hi Cristina,
>>> Some interesting issues you raise.
>>> One of them is how people publish links (which enables your analysis).
>>> There are two ways this happens.
>>> 1) People add triples to their dataset that have an equivalence
>>> predicate
>> (owl:sameAs, skos:exactMatch, skos:closeMatch, etc.)
>>> 2) People use a "foreign" URI (very commonly a dbpedia URI), because
>> when turning
>> their data into RDF they have decided that the entity they are concerned
>> with is
>> the same as the dbpedia one. The second paragraph of Tom's message
>> describes such
>> a linkage, I think.
>>> I think these distinctions are behind the comments of Milorad, where he
>>> is
>> assuming the type (2) way.
>>> Either of these methods should be fodder for you, and you may well find
>> that the
>> type (2) way is used by a dataset that is useful to you.
>>>
>>>
>> I agree, it is important to distinguish between different types of
>> links.
>> When I refer to interlinking I have in mind triples (s, p, o),
>> where "s" and "o" are resources from different data sets, and "p" is
>> either a
>> property like owl:sameAs or a domain-specific property like foaf:knows.
>> I
>> think this corresponds to what you specified in 1) and 2). I would like
>> to
>> have both kinds of links in my evaluation (if possible).
>>
>>> It may be harder for you to process, as the linkage is not so explicit
>> because
>> there is no distinct URI for the resource in the database, different
>> from the
>> "foreign" one. But any "foreign" URI is in fact a link.
>>> You will find that people have tended towards type (2) linkage because
>> they can
>> shy away from having lots of equivalence predicates in their datasets,
>> not
>> least
>> because there was a time when RDF stores did not comfortably do
>> owl:sameAs
>> inference, and so they do the linking at RDF conversion time, and use
>> "foreign"
>> URIs.
>>>
>>> Another interesting issue is more fundamental to your work.
>>> You seem to think that there must be a "gold standard or reference
>> interlinking"
>> for equivalence.
>>> As long-time readers of this list will have seen discussed many times
>> (!), it is
>> not a simple matter.
>>>
>>> It is a complex matter to have such a thing, which is a necessity for
>> you to do
>> your precision/recall statistics.
>>> At its most basic, for example, am I as a private citizen the same as
>>> me
>> as a
>> member of my University or me as a member of my company?
>>> The answer is, of course yes and no.
>>> Another field that has spent a lot of time on this is the FRBR world
>> (http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records).
>>> If I have a book of the Semantic Web, is it the same as your book of
>>> the
>> same name?
>>> Perhaps. What if it is a different (corrected) edition? An electronic
>> version?
>>> Certainly a library will usually consider each book a different thing,
>> but if you
>> are asking how many books the author has published, you want to treat
>> all the
>> books as the same resource.
>>>
>> I understand the point, and I find it very interesting, indeed. I guess
>> that it might depend on the context where the data was created / will be
>> used. This reminds me of the paper about the analysis of identity links
>> (
>> http://www.w3.org/2009/12/rdf-ws/papers/ws21). However, I think that it
>> is
>> possible to evaluate different interlinking techniques, establishing
>> some
>> gold standard (e.g. the links between the cities of a data set
>> describing
>> the population of European cities and a data set describing the cities
>> as
>> tourist attractions), to be able to analyse the results in terms of
>> precision and recall, and say that one tool is able to certain things,
>> while the other not.
>>
>> Regarding the humans behind the manual definition or the reviewing
>> process of a reference interlinking, I would expect them to be
>> knowledgeable (i.e. domain experts should have been part of the process
>> at
>> least).
>>
>>>
>>> So in asking for a "gold standard or reference interlinking", I think
>> you are
>> chasing a chimera.
>>> What you can do is choose datasets and then you will need to find out
>> what the
>> policies of the equivalence creators; and then you will need to build
>> your
>> system
>> so that it implements the same policies.
>>> By the way, policies usually relate to the way in which the dataset
>>> will
>> be used,
>> rather than the wishes of the publisher of the data - there is no
>> absolute
>> truth
>> in this. Some would argue there is never any equivalence: "One cannot
>> step
>> once
>> (sic) into the same stream" (http://en.wikipedia.org/wiki/Cratylus)
>>>
>> Thanks for the suggestion.
>>
>>> It's great you have asked the question - convincing research in this
>> field is very
>> challenging!
>>>
>>> Best
>>> Hugh
>>>
>> Best,
>> Cristina
>>
>
>

Received on Tuesday, 27 August 2013 17:34:35 UTC