dbpedia + disambiguation pages

Hi all,

While browsing dbpedia.org I recognized your names from the sem-web mailing
list and wanted to send along a question.

Have you done any thinking about extracting disambiguation information from
disambiguation pages? I was working on a similar project to extract
structured info from wikipedia.org to be used as the basis for a sem-web
project (until I came across dbpedia.org), and this is one thing I was
targeting that I couldn't find any mention of on dbpedia.org.

I extract all the list items from a particular disambiguation page and
perform some basic processing to try and determine the disambiguated
article/concept. The Apple disambiguation
page<http://en.wikipedia.org/wiki/Apple_%2528disambiguation%2529> is
a good example of some of the different styles of information you get:

1. Apple Brook <http://en.wikipedia.org/wiki/Apple_Brook>, a British actress


Simple to extract a mapping between the ambiguous "Apple" and Apple Brook,
along with a potentially useful single sentence abstract.

2.
*Apple* (album) <http://en.wikipedia.org/wiki/Apple_%28album%29>, an album
by Mother Love Bone <http://en.wikipedia.org/wiki/Mother_Love_Bone>

or

Ariane Passenger Payload
Experiment<http://en.wikipedia.org/wiki/Ariane_Passenger_Payload_Experiment>,
an Indian experimental communication satellite with a C-Band
transponder<http://en.wikipedia.org/wiki/Transponder>launched in 1981.

Multiple links, so it's not immediately obvious which one is the
disambiguated concept, but you can imagine heuristics to make connections
here.

3. any of the *computers* made by Apple
Inc.<http://en.wikipedia.org/wiki/Apple_Inc.>since
1976 <http://en.wikipedia.org/wiki/1976>, notably the Apple
Macintosh<http://en.wikipedia.org/wiki/Apple_Macintosh>

Somewhat unclear disambiguation, potentially difficult to extract the
correct relationship.

I haven't done a lot of thinking about the proper way to represent these
relationships in RDF, I was just writing back to a custom DB schema for now,
but I think the information is highly valuable.

Also, similar to this, but easier to extract, is the synonym information
stored in the redirect links; are you currently extracting multiple
rdfs:label-s based on these redirects?

If you have a minute let me know your thoughts on this.

Chris

Received on Saturday, 4 August 2007 18:18:46 UTC