Re: dbpedia + disambiguation pages from Martin Hepp (UIBK) on 2007-11-11 (semantic-web@w3.org from November 2007)

From: Martin Hepp (UIBK) <martin.hepp@uibk.ac.at>
Date: Sun, 11 Nov 2007 10:16:05 +0100
To: Richard Cyganiak <richard@cyganiak.de>
CC: Chris Richard <chris.richard@gmail.com>, Kingsley Idehen <kidehen@openlinksw.com>, Chris Bizer <chris@bizer.de>, Semantic Web <semantic-web@w3.org>, dbpedia-discussion@lists.sourceforge.net
Message-ID: <4736C855.5080503@uibk.ac.at>
Hi all:

 > Note that redirects are often not synonyms, but artifacts of Wikipedia's
 > evolution. Redirects contain things like misspelled names, names that
 > adhere to older naming conventions (e.g. the original WikiWords
 > CamelCase naming convention), instances where multiple articles were
 > folded into one etc. Thus they make poor labels.

A bit late some related input: In the course of our paper [1], we did a 
quantitative analysis of redirects in Wikipedia (English): Here are the 
results in a nutshell:

"Redirection Pages

•    78% of the redirection pages are obvious synonyms (in particular 
spelling variants or changes in word order of composite words),
•    12 % reflect pages for which the content was integrated into other 
pages,
•    for 10%, we could not quickly identify the semantic relationship 
(we also did not try very hard ;-)).

With regard to the impact on our analysis, we can observe the following: 
First, for the vast majority (78%) of all URI’s that represent 
redirects, there is no semantic difference, since they are synonyms. For 
22% (10 + 12 %) of the redirects,  semantic differences between the 
original URI and the target of the redirect cannot be excluded. In 12 % 
of the cases, the redirect points to a page that incorporates the 
original content in a larger article."

See http://www.heppnetz.de/harvesting-wikipedia/ for more information 
and [1] for the full paper (also available for download on that page).

Best
Martin

[1] Martin Hepp, Katharina Siorpaes, Daniel Bachlechner: Harvesting Wiki 
Consensus: Using Wikipedia Entries as Vocabulary  for Knowledge 
Management, IEEE Internet Computing, Vol. 11, No. 5, pp. 54-65, Sept-Oct 
2007. Available at http://www.heppnetz.de/harvesting-wikipedia/
----------------------------------
martin hepp, http://www.heppnetz.de
mhepp@computer.org



Richard Cyganiak wrote:
> 
> Chris,
> 
> Since your question is quite specific to DBpedia, let's continue the 
> discussion at the DBpedia mailing list (see 
> http://dbpedia.org/docs/#support and CC). Please consider remove 
> semantic-web@w3.org from the CC list for further replies.
> 
> On 4 Aug 2007, at 20:18, Chris Richard wrote:
>> Have you done any thinking about extracting disambiguation information 
>> from disambiguation pages?
> 
> No, we currently don't do any special processing for Wikipedia's 
> disambiguation pages.
> 
> The main focus of DBpedia is extraction of information about the 
> *things* described in Wikipedia articles, to enable domain queries over 
> this information. Disambiguation information isn't really about those 
> things, it's about the names we use to refer to those things. 
> (Specifically, when a single name could refer to more than one of those 
> things.) So it's more linguistic in nature, and hasn't registered 
> prominently on our priority list.
> 
>> I was working on a similar project to extract structured info from 
>> wikipedia.org to be used as the basis for a sem-web project (until I 
>> came across dbpedia.org), and this is one thing I was targeting that I 
>> couldn't find any mention of on dbpedia.org.
>>
>> I extract all the list items from a particular disambiguation page and 
>> perform some basic processing to try and determine the disambiguated 
>> article/concept. The Apple disambiguation page is a good example of 
>> some of the different styles of information you get:
>>
>> 1. Apple Brook, a British actress
>>
>> Simple to extract a mapping between the ambiguous "Apple" and Apple 
>> Brook, along with a potentially useful single sentence abstract.
>>
>> 2.
>> Apple (album), an album by Mother Love Bone
>>
>> or
>>
>> Ariane Passenger Payload Experiment, an Indian experimental 
>> communication satellite with a C-Band transponder launched in 1981.
>>
>> Multiple links, so it's not immediately obvious which one is the 
>> disambiguated concept, but you can imagine heuristics to make 
>> connections here.
> 
> I think that a large part of the disambiguation information could be 
> captured using relatively simple heuristics. There's no need to capture 
> everything, 80% might be “good enough”.
> 
> The DBpedia codebase has pluggable “Extractors” that produce RDF triples 
> from an article's source code; this would be yet another extractor.
> 
>> 3. any of the computers made by Apple Inc. since 1976, notably the 
>> Apple Macintosh
>>
>> Somewhat unclear disambiguation, potentially difficult to extract the 
>> correct relationship.
>>
>> I haven't done a lot of thinking about the proper way to represent 
>> these relationships in RDF, I was just writing back to a custom DB 
>> schema for now,
> 
> I don't know how to represent this in RDF. DBpedia defines one resource 
> from each Wikipedia article, assuming that the topic of each article is 
> some meaningful entity in the real world. This certainly doesn't hold 
> for disambiguation pages, whose topic is not a single thing, but a 
> multitude of things that happen to be related to some name, word, or term.
> 
>> but I think the information is highly valuable.
> 
> Can you give us some examples where you think this information could be 
> used?
> 
>>  Also, similar to this, but easier to extract, is the synonym 
>> information stored in the redirect links; are you currently extracting 
>> multiple rdfs:label-s based on these redirects?
> 
> The next update will include dbpedia:redirectsTo triples for redirected 
> articles.
> 
> Note that redirects are often not synonyms, but artifacts of Wikipedia's 
> evolution. Redirects contain things like misspelled names, names that 
> adhere to older naming conventions (e.g. the original WikiWords 
> CamelCase naming convention), instances where multiple articles were 
> folded into one etc. Thus they make poor labels.
> 
> Cheers,
> Richard
> 
> 
> 
>>
>> If you have a minute let me know your thoughts on this.
>>
>> Chris
>>
>>
> 
> 
>
Received on Monday, 12 November 2007 05:27:40 UTC