- From: Richard Cyganiak <richard@cyganiak.de>
- Date: Sun, 5 Aug 2007 13:25:30 +0200
- To: Chris Richard <chris.richard@gmail.com>
- Cc: Kingsley Idehen <kidehen@openlinksw.com>, Chris Bizer <chris@bizer.de>, Semantic Web <semantic-web@w3.org>, dbpedia-discussion@lists.sourceforge.net
Chris, Since your question is quite specific to DBpedia, let's continue the discussion at the DBpedia mailing list (see http://dbpedia.org/docs/ #support and CC). Please consider remove semantic-web@w3.org from the CC list for further replies. On 4 Aug 2007, at 20:18, Chris Richard wrote: > Have you done any thinking about extracting disambiguation > information from disambiguation pages? No, we currently don't do any special processing for Wikipedia's disambiguation pages. The main focus of DBpedia is extraction of information about the *things* described in Wikipedia articles, to enable domain queries over this information. Disambiguation information isn't really about those things, it's about the names we use to refer to those things. (Specifically, when a single name could refer to more than one of those things.) So it's more linguistic in nature, and hasn't registered prominently on our priority list. > I was working on a similar project to extract structured info from > wikipedia.org to be used as the basis for a sem-web project (until > I came across dbpedia.org), and this is one thing I was targeting > that I couldn't find any mention of on dbpedia.org. > > I extract all the list items from a particular disambiguation page > and perform some basic processing to try and determine the > disambiguated article/concept. The Apple disambiguation page is a > good example of some of the different styles of information you get: > > 1. Apple Brook, a British actress > > Simple to extract a mapping between the ambiguous "Apple" and Apple > Brook, along with a potentially useful single sentence abstract. > > 2. > Apple (album), an album by Mother Love Bone > > or > > Ariane Passenger Payload Experiment, an Indian experimental > communication satellite with a C-Band transponder launched in 1981. > > Multiple links, so it's not immediately obvious which one is the > disambiguated concept, but you can imagine heuristics to make > connections here. I think that a large part of the disambiguation information could be captured using relatively simple heuristics. There's no need to capture everything, 80% might be “good enough”. The DBpedia codebase has pluggable “Extractors” that produce RDF triples from an article's source code; this would be yet another extractor. > 3. any of the computers made by Apple Inc. since 1976, notably the > Apple Macintosh > > Somewhat unclear disambiguation, potentially difficult to extract > the correct relationship. > > I haven't done a lot of thinking about the proper way to represent > these relationships in RDF, I was just writing back to a custom DB > schema for now, I don't know how to represent this in RDF. DBpedia defines one resource from each Wikipedia article, assuming that the topic of each article is some meaningful entity in the real world. This certainly doesn't hold for disambiguation pages, whose topic is not a single thing, but a multitude of things that happen to be related to some name, word, or term. > but I think the information is highly valuable. Can you give us some examples where you think this information could be used? > Also, similar to this, but easier to extract, is the synonym > information stored in the redirect links; are you currently > extracting multiple rdfs:label-s based on these redirects? The next update will include dbpedia:redirectsTo triples for redirected articles. Note that redirects are often not synonyms, but artifacts of Wikipedia's evolution. Redirects contain things like misspelled names, names that adhere to older naming conventions (e.g. the original WikiWords CamelCase naming convention), instances where multiple articles were folded into one etc. Thus they make poor labels. Cheers, Richard > > If you have a minute let me know your thoughts on this. > > Chris > >
Received on Sunday, 5 August 2007 11:25:34 UTC