- From: Martin Hepp (UIBK) <martin.hepp@uibk.ac.at>
- Date: Sun, 11 Nov 2007 10:16:05 +0100
- To: Richard Cyganiak <richard@cyganiak.de>
- CC: Chris Richard <chris.richard@gmail.com>, Kingsley Idehen <kidehen@openlinksw.com>, Chris Bizer <chris@bizer.de>, Semantic Web <semantic-web@w3.org>, dbpedia-discussion@lists.sourceforge.net
Hi all: > Note that redirects are often not synonyms, but artifacts of Wikipedia's > evolution. Redirects contain things like misspelled names, names that > adhere to older naming conventions (e.g. the original WikiWords > CamelCase naming convention), instances where multiple articles were > folded into one etc. Thus they make poor labels. A bit late some related input: In the course of our paper [1], we did a quantitative analysis of redirects in Wikipedia (English): Here are the results in a nutshell: "Redirection Pages • 78% of the redirection pages are obvious synonyms (in particular spelling variants or changes in word order of composite words), • 12 % reflect pages for which the content was integrated into other pages, • for 10%, we could not quickly identify the semantic relationship (we also did not try very hard ;-)). With regard to the impact on our analysis, we can observe the following: First, for the vast majority (78%) of all URI’s that represent redirects, there is no semantic difference, since they are synonyms. For 22% (10 + 12 %) of the redirects, semantic differences between the original URI and the target of the redirect cannot be excluded. In 12 % of the cases, the redirect points to a page that incorporates the original content in a larger article." See http://www.heppnetz.de/harvesting-wikipedia/ for more information and [1] for the full paper (also available for download on that page). Best Martin [1] Martin Hepp, Katharina Siorpaes, Daniel Bachlechner: Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management, IEEE Internet Computing, Vol. 11, No. 5, pp. 54-65, Sept-Oct 2007. Available at http://www.heppnetz.de/harvesting-wikipedia/ ---------------------------------- martin hepp, http://www.heppnetz.de mhepp@computer.org Richard Cyganiak wrote: > > Chris, > > Since your question is quite specific to DBpedia, let's continue the > discussion at the DBpedia mailing list (see > http://dbpedia.org/docs/#support and CC). Please consider remove > semantic-web@w3.org from the CC list for further replies. > > On 4 Aug 2007, at 20:18, Chris Richard wrote: >> Have you done any thinking about extracting disambiguation information >> from disambiguation pages? > > No, we currently don't do any special processing for Wikipedia's > disambiguation pages. > > The main focus of DBpedia is extraction of information about the > *things* described in Wikipedia articles, to enable domain queries over > this information. Disambiguation information isn't really about those > things, it's about the names we use to refer to those things. > (Specifically, when a single name could refer to more than one of those > things.) So it's more linguistic in nature, and hasn't registered > prominently on our priority list. > >> I was working on a similar project to extract structured info from >> wikipedia.org to be used as the basis for a sem-web project (until I >> came across dbpedia.org), and this is one thing I was targeting that I >> couldn't find any mention of on dbpedia.org. >> >> I extract all the list items from a particular disambiguation page and >> perform some basic processing to try and determine the disambiguated >> article/concept. The Apple disambiguation page is a good example of >> some of the different styles of information you get: >> >> 1. Apple Brook, a British actress >> >> Simple to extract a mapping between the ambiguous "Apple" and Apple >> Brook, along with a potentially useful single sentence abstract. >> >> 2. >> Apple (album), an album by Mother Love Bone >> >> or >> >> Ariane Passenger Payload Experiment, an Indian experimental >> communication satellite with a C-Band transponder launched in 1981. >> >> Multiple links, so it's not immediately obvious which one is the >> disambiguated concept, but you can imagine heuristics to make >> connections here. > > I think that a large part of the disambiguation information could be > captured using relatively simple heuristics. There's no need to capture > everything, 80% might be “good enough”. > > The DBpedia codebase has pluggable “Extractors” that produce RDF triples > from an article's source code; this would be yet another extractor. > >> 3. any of the computers made by Apple Inc. since 1976, notably the >> Apple Macintosh >> >> Somewhat unclear disambiguation, potentially difficult to extract the >> correct relationship. >> >> I haven't done a lot of thinking about the proper way to represent >> these relationships in RDF, I was just writing back to a custom DB >> schema for now, > > I don't know how to represent this in RDF. DBpedia defines one resource > from each Wikipedia article, assuming that the topic of each article is > some meaningful entity in the real world. This certainly doesn't hold > for disambiguation pages, whose topic is not a single thing, but a > multitude of things that happen to be related to some name, word, or term. > >> but I think the information is highly valuable. > > Can you give us some examples where you think this information could be > used? > >> Also, similar to this, but easier to extract, is the synonym >> information stored in the redirect links; are you currently extracting >> multiple rdfs:label-s based on these redirects? > > The next update will include dbpedia:redirectsTo triples for redirected > articles. > > Note that redirects are often not synonyms, but artifacts of Wikipedia's > evolution. Redirects contain things like misspelled names, names that > adhere to older naming conventions (e.g. the original WikiWords > CamelCase naming convention), instances where multiple articles were > folded into one etc. Thus they make poor labels. > > Cheers, > Richard > > > >> >> If you have a minute let me know your thoughts on this. >> >> Chris >> >> > > >
Received on Monday, 12 November 2007 05:27:40 UTC