W3C home > Mailing lists > Public > public-lod@w3.org > February 2013

Re: two datasets for DBLP

From: Hugh Glaser <hg@ecs.soton.ac.uk>
Date: Thu, 28 Feb 2013 21:40:11 +0000
To: Kalpa Gunaratna <kalpagunaratna@gmail.com>
CC: "<public-lod@w3.org>" <public-lod@w3.org>
Message-ID: <387E72E216DF1247A2F8ED4819C93BA747843E28@UOS-MSG00041-SI.soton.ac.uk>
Hi Kalpa,
As the person responsible for the second site, here is an explanation.
It's quite long, but you did ask, and maybe some people will find it useful.
Firstly, DBLP is a stunning resource, and so for the rkbexplorer (and now other ) services, we were keen to have their data.
Let me say that again - DBLP is a stunning resource.

So why do we take a copy of their data (which they helpfully provide) and publish it as Linked Data?
Well, we wanted it as Linked Data. But in fact there is another Linked Data site with the same data, and my best recollection was that it was already in existence when we brought up our site in what must have been about 2005.
We didn't really want to, but there was a problem with the data at source [1] .
DBLP is essentially for searching. So for their purpose, they prefer to have high recall when the name of an author is put in. That is, they are quite liberal (it seems) about whether two authors of the same name are the same person, because they don't want to miss out on any cases (false negatives).
NLP people will tell you that the price of high recall is low precision - there will be more cases where they incorrectly conflate two authors (false positives).
See for the beginnings of this discussion http://eprints.soton.ac.uk/id/eprint/264361 .
In fact we did some analysis of the extent of the problem (http://eprints.soton.ac.uk/id/eprint/265181 ) and without too much trouble we found that in source [1], one author URI that was a conflation of 15 different people (as best we could tell).
I am not certain whether the problem came from their version of the DBLP data, or was introduced by the process of building source [1].

Our purposes were more complex - we were using the information as part of a more involved knowledge processing system, which included inferring information based on the semantic relationships, and any false positives caused a knock-on effect.
For example (as best I recall, and in fact the thing that first raised the problem for us), there was a conflation of two Prof Tom Andersons - one at the University of Newcastle, UK, and another in California. So when you looked at the UK Tom Anderson, we inferred that he was funded to a large extent by the US government, and indeed we therefore inferred that the University of Newcastle was also funded by the US government to a much greater extent than it was. Further author problems then would have caused us to deduce that the University of Newcastle, UK was the same institution as the University of Newcastle, NSW, Australia.
you will therefore understand that the precision/recall needs of our application were very different from those of the DBLP site.

This situation was and is not unique to DBLP - it has been true of almost every source we have tried to use. Last time I looked, the ACM library had conflated the two Universities of Newcastle. And it is also a problem for other sites - Microsoft Academic Search has me as the same Glaser as someone who published before I was born. And last time I tried to check, I found that "Hugh Glaser" was Google unique.

So we now (periodically) download the DBLP dump and convert it to RDF and publish it as Linked Data.
But with our completely independent view of author disambiguation (we call it co-reference).
In fact, since we were doing it, we used the AKT ontology, which was more convenient to us (note to Kingsley - it isn;t just another publication of the same RDF, it is actually uses a completely different ontology).
So source [2] is DBLP data (which does not have URIs for authors at all, it just has strings), with our own URIs.
We generate a new, unique, URI for every author on every paper, and then do our own analysis to conflate them.

Finally, the sameAs relations with source [1]: since the source [1] URIs for papers are safe, we establish sameAs with them. But for authors, we can't safely do that, as the follow-your-nose would suck in the incorrect information; so our system is explicitly fixed to reject such Linked Data from source [1]. And in fact, when I do http://sameas.org harvesting I avoid source [1].

It may be that things are different now - I haven't done any checking for quite a few years.

As I say, I have gone on at some length here, but I think this is an instance of a very important issue for Linked Data applications - some would argue that much of the Linked Data cloud is derived from similar data that has been set to prefer recall over precision.

Thanks for reminding me to refresh source [2], it was very out of date!


On 27 Feb 2013, at 12:10, Kalpa Gunaratna <kalpagunaratna@gmail.com> wrote:

> Hi,
>    I am trying to do an alignment task between LOD datasets and came to see that DBLP has two different datasets hosted in two places possibly with different schemas. Following are the two URLs of them.
> http://dblp.l3s.de/d2r/ [1]
> http://dblp.rkbexplorer.com/ [2]
> both these datasets have DBLP publlications but use different schemas for presenting facts. Most of the time [1] has sameAs links to [2] and also [2] to [1]. Anybody know why there are two datasets or maintain two datasets for same information. Is any of these is complementary to the other? 
> -- 
> Regards 
> Kalpa Gunaratna
Received on Thursday, 28 February 2013 21:41:26 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:46 UTC