Re: dbpedia + disambiguation pages from Richard Cyganiak on 2007-08-05 (semantic-web@w3.org from August 2007)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Sun, 5 Aug 2007 13:25:30 +0200
To: Chris Richard <chris.richard@gmail.com>
Cc: Kingsley Idehen <kidehen@openlinksw.com>, Chris Bizer <chris@bizer.de>, Semantic Web <semantic-web@w3.org>, dbpedia-discussion@lists.sourceforge.net
Message-Id: <CF15405B-037C-4C48-B39B-1C1A2C4FE10B@cyganiak.de>

Chris,

Since your question is quite specific to DBpedia, let's continue the  
discussion at the DBpedia mailing list (see http://dbpedia.org/docs/ 
#support and CC). Please consider remove semantic-web@w3.org from the  
CC list for further replies.

On 4 Aug 2007, at 20:18, Chris Richard wrote:
> Have you done any thinking about extracting disambiguation  
> information from disambiguation pages?

No, we currently don't do any special processing for Wikipedia's  
disambiguation pages.

The main focus of DBpedia is extraction of information about the  
*things* described in Wikipedia articles, to enable domain queries  
over this information. Disambiguation information isn't really about  
those things, it's about the names we use to refer to those things.  
(Specifically, when a single name could refer to more than one of  
those things.) So it's more linguistic in nature, and hasn't  
registered prominently on our priority list.

> I was working on a similar project to extract structured info from  
> wikipedia.org to be used as the basis for a sem-web project (until  
> I came across dbpedia.org), and this is one thing I was targeting  
> that I couldn't find any mention of on dbpedia.org.
>
> I extract all the list items from a particular disambiguation page  
> and perform some basic processing to try and determine the  
> disambiguated article/concept. The Apple disambiguation page is a  
> good example of some of the different styles of information you get:
>
> 1. Apple Brook, a British actress
>
> Simple to extract a mapping between the ambiguous "Apple" and Apple  
> Brook, along with a potentially useful single sentence abstract.
>
> 2.
> Apple (album), an album by Mother Love Bone
>
> or
>
> Ariane Passenger Payload Experiment, an Indian experimental  
> communication satellite with a C-Band transponder launched in 1981.
>
> Multiple links, so it's not immediately obvious which one is the  
> disambiguated concept, but you can imagine heuristics to make  
> connections here.

I think that a large part of the disambiguation information could be  
captured using relatively simple heuristics. There's no need to  
capture everything, 80% might be “good enough”.

The DBpedia codebase has pluggable “Extractors” that produce RDF  
triples from an article's source code; this would be yet another  
extractor.

> 3. any of the computers made by Apple Inc. since 1976, notably the  
> Apple Macintosh
>
> Somewhat unclear disambiguation, potentially difficult to extract  
> the correct relationship.
>
> I haven't done a lot of thinking about the proper way to represent  
> these relationships in RDF, I was just writing back to a custom DB  
> schema for now,

I don't know how to represent this in RDF. DBpedia defines one  
resource from each Wikipedia article, assuming that the topic of each  
article is some meaningful entity in the real world. This certainly  
doesn't hold for disambiguation pages, whose topic is not a single  
thing, but a multitude of things that happen to be related to some  
name, word, or term.

> but I think the information is highly valuable.

Can you give us some examples where you think this information could  
be used?

>  Also, similar to this, but easier to extract, is the synonym  
> information stored in the redirect links; are you currently  
> extracting multiple rdfs:label-s based on these redirects?

The next update will include dbpedia:redirectsTo triples for  
redirected articles.

Note that redirects are often not synonyms, but artifacts of  
Wikipedia's evolution. Redirects contain things like misspelled  
names, names that adhere to older naming conventions (e.g. the  
original WikiWords CamelCase naming convention), instances where  
multiple articles were folded into one etc. Thus they make poor labels.

Cheers,
Richard

>
> If you have a minute let me know your thoughts on this.
>
> Chris
>
>

Received on Sunday, 5 August 2007 11:25:34 UTC