Re: Linked data sets for evaluating interlinking?

Hi Hugh,


>Hi Cristina,
> Some interesting issues you raise.
> One of them is how people publish links (which enables your analysis).
> There are two ways this happens.
> 1) People add triples to their dataset that have an equivalence predicate
(owl:sameAs, skos:exactMatch, skos:closeMatch, etc.)
> 2) People use a "foreign" URI (very commonly a dbpedia URI), because
when turning
their data into RDF they have decided that the entity they are concerned
with is
the same as the dbpedia one. The second paragraph of Tom's message
describes such
a linkage, I think.
> I think these distinctions are behind the comments of Milorad, where he is
assuming the type (2) way.
> Either of these methods should be fodder for you, and you may well find
that the
type (2) way is used by a dataset that is useful to you.
>
>
I agree, it is important to distinguish between different types of links.
When I refer to interlinking I have in mind triples (s, p, o),
where "s" and "o" are resources from different data sets, and "p" is either a
property like owl:sameAs or a domain-specific property like foaf:knows.  I
think this corresponds to what you specified in 1) and 2). I would like to
have both kinds of links in my evaluation (if possible).

> It may be harder for you to process, as the linkage is not so explicit
because
there is no distinct URI for the resource in the database, different from the
"foreign" one. But any "foreign" URI is in fact a link.
> You will find that people have tended towards type (2) linkage because
they can
shy away from having lots of equivalence predicates in their datasets, not
least
because there was a time when RDF stores did not comfortably do owl:sameAs
inference, and so they do the linking at RDF conversion time, and use
"foreign"
URIs.
>
> Another interesting issue is more fundamental to your work.
> You seem to think that there must be a "gold standard or reference
interlinking"
for equivalence.
> As long-time readers of this list will have seen discussed many times
(!), it is
not a simple matter.
>
> It is a complex matter to have such a thing, which is a necessity for
you to do
your precision/recall statistics.
> At its most basic, for example, am I as a private citizen the same as me
as a
member of my University or me as a member of my company?
> The answer is, of course yes and no.
> Another field that has spent a lot of time on this is the FRBR world
(http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records).
> If I have a book of the Semantic Web, is it the same as your book of the
same name?
> Perhaps. What if it is a different (corrected) edition? An electronic
version?
> Certainly a library will usually consider each book a different thing,
but if you
are asking how many books the author has published, you want to treat all the
books as the same resource.
>
I understand the point, and I find it very interesting, indeed. I guess
that it might depend on the context where the data was created / will be
used. This reminds me of the paper about the analysis of identity links (
http://www.w3.org/2009/12/rdf-ws/papers/ws21). However, I think that it is
possible to evaluate different interlinking techniques, establishing some
gold standard (e.g. the links between the cities of a data set describing
the population of European cities and a data set describing the cities as
tourist attractions), to be able to analyse the results in terms of
precision and recall, and say that one tool is able to certain things,
while the other not.

Regarding the humans behind the manual definition or the reviewing
process of a reference interlinking, I would expect them to be
knowledgeable (i.e. domain experts should have been part of the process at
least).

>
> So in asking for a "gold standard or reference interlinking", I think
you are
chasing a chimera.
> What you can do is choose datasets and then you will need to find out
what the
policies of the equivalence creators; and then you will need to build your
system
so that it implements the same policies.
> By the way, policies usually relate to the way in which the dataset will
be used,
rather than the wishes of the publisher of the data - there is no absolute
truth
in this. Some would argue there is never any equivalence: "One cannot step
once
(sic) into the same stream" (http://en.wikipedia.org/wiki/Cratylus)
>
Thanks for the suggestion.

> It's great you have asked the question - convincing research in this
field is very
challenging!
>
> Best
> Hugh
>
Best,
Cristina

Received on Tuesday, 27 August 2013 15:58:02 UTC