Fwd: dbpedia + disambiguation pages

I thought this might be interesting to folks here, and as an input to  
the RI note.

I haven't had a chance to read the paper yet, just this email. There  
are some good analogies and disanalogies to be found (ok, it's "only"  
one site, but wikipedia behaves, in many ways, like the open web:  
decentralized minting and evolution; "authority" however, is  
weaker...the content of each page can have arbitrary authors  and  
while editors have some powers, there are strong rules against their  
using those powers to overconstrain content; etc. etc.).

One big difference is that disagreement and variance can easily be  
accumulated on the canonical page. (So, even for terms that are *not*  
strictly synonyms, redirecting is ok; I worry a little about the  
blithe comfort drawn from synonymy, since, e.g., connotation may be  
obliterated.) There is a heavy use of disambiguation pages. That  
would be an interesting tactic in building a big ontologies: some  
terms have "disambiguation axioms"...i.e., if there was controversy  
or independent evolution, you could pop in some sense and conflict  
disambigation axioms (this really is just the same term; that really  
is a different one; for Y's POV on term X see...) With a few simple  
tools (SKOSiness; alignment axioms; good change representation...we  
had this in swoop in prototype form wherein you could create a  
"virtual version" by applying diffs to the canonical source) this  
could work well.

(Btw, sameAs is almost always the wrong tool, imho ;) Not that we've  
had much better....)

Cheers,
Bijan.

Begin forwarded message:

> Resent-From: semantic-web@w3.org
> From: "Martin Hepp (UIBK)" <martin.hepp@uibk.ac.at>
> Date: November 11, 2007 9:16:05 AM BST
> To: Richard Cyganiak <richard@cyganiak.de>
> Cc: Chris Richard <chris.richard@gmail.com>, Kingsley Idehen  
> <kidehen@openlinksw.com>, Chris Bizer <chris@bizer.de>, Semantic  
> Web <semantic-web@w3.org>, dbpedia-discussion@lists.sourceforge.net
> Subject: Re: dbpedia + disambiguation pages
> Reply-To: martin.hepp@uibk.ac.at
> Archived-At: <http://www.w3.org/mid/4736C855.5080503@uibk.ac.at>
>
>
> Hi all:
>
> > Note that redirects are often not synonyms, but artifacts of  
> Wikipedia's
> > evolution. Redirects contain things like misspelled names, names  
> that
> > adhere to older naming conventions (e.g. the original WikiWords
> > CamelCase naming convention), instances where multiple articles were
> > folded into one etc. Thus they make poor labels.
>
> A bit late some related input: In the course of our paper [1], we  
> did a quantitative analysis of redirects in Wikipedia (English):  
> Here are the results in a nutshell:
>
> "Redirection Pages
>
> •    78% of the redirection pages are obvious synonyms (in  
> particular spelling variants or changes in word order of composite  
> words),
> •    12 % reflect pages for which the content was integrated into  
> other pages,
> •    for 10%, we could not quickly identify the semantic  
> relationship (we also did not try very hard ;-)).
>
> With regard to the impact on our analysis, we can observe the  
> following: First, for the vast majority (78%) of all URI’s that  
> represent redirects, there is no semantic difference, since they  
> are synonyms. For 22% (10 + 12 %) of the redirects,  semantic  
> differences between the original URI and the target of the redirect  
> cannot be excluded. In 12 % of the cases, the redirect points to a  
> page that incorporates the original content in a larger article."
>
> See http://www.heppnetz.de/harvesting-wikipedia/ for more  
> information and [1] for the full paper (also available for download  
> on that page).
>
> Best
> Martin
>
> [1] Martin Hepp, Katharina Siorpaes, Daniel Bachlechner: Harvesting  
> Wiki Consensus: Using Wikipedia Entries as Vocabulary  for  
> Knowledge Management, IEEE Internet Computing, Vol. 11, No. 5, pp.  
> 54-65, Sept-Oct 2007. Available at http://www.heppnetz.de/ 
> harvesting-wikipedia/
> ----------------------------------
> martin hepp, http://www.heppnetz.de
> mhepp@computer.org
>
>
>
> Richard Cyganiak wrote:
>> Chris,
>> Since your question is quite specific to DBpedia, let's continue  
>> the discussion at the DBpedia mailing list (see http://dbpedia.org/ 
>> docs/#support and CC). Please consider remove semantic-web@w3.org  
>> from the CC list for further replies.
>> On 4 Aug 2007, at 20:18, Chris Richard wrote:
>>> Have you done any thinking about extracting disambiguation  
>>> information from disambiguation pages?
>> No, we currently don't do any special processing for Wikipedia's  
>> disambiguation pages.
>> The main focus of DBpedia is extraction of information about the  
>> *things* described in Wikipedia articles, to enable domain queries  
>> over this information. Disambiguation information isn't really  
>> about those things, it's about the names we use to refer to those  
>> things. (Specifically, when a single name could refer to more than  
>> one of those things.) So it's more linguistic in nature, and  
>> hasn't registered prominently on our priority list.
>>> I was working on a similar project to extract structured info  
>>> from wikipedia.org to be used as the basis for a sem-web project  
>>> (until I came across dbpedia.org), and this is one thing I was  
>>> targeting that I couldn't find any mention of on dbpedia.org.
>>>
>>> I extract all the list items from a particular disambiguation  
>>> page and perform some basic processing to try and determine the  
>>> disambiguated article/concept. The Apple disambiguation page is a  
>>> good example of some of the different styles of information you get:
>>>
>>> 1. Apple Brook, a British actress
>>>
>>> Simple to extract a mapping between the ambiguous "Apple" and  
>>> Apple Brook, along with a potentially useful single sentence  
>>> abstract.
>>>
>>> 2.
>>> Apple (album), an album by Mother Love Bone
>>>
>>> or
>>>
>>> Ariane Passenger Payload Experiment, an Indian experimental  
>>> communication satellite with a C-Band transponder launched in 1981.
>>>
>>> Multiple links, so it's not immediately obvious which one is the  
>>> disambiguated concept, but you can imagine heuristics to make  
>>> connections here.
>> I think that a large part of the disambiguation information could  
>> be captured using relatively simple heuristics. There's no need to  
>> capture everything, 80% might be “good enough”.
>> The DBpedia codebase has pluggable “Extractors” that produce RDF  
>> triples from an article's source code; this would be yet another  
>> extractor.
>>> 3. any of the computers made by Apple Inc. since 1976, notably  
>>> the Apple Macintosh
>>>
>>> Somewhat unclear disambiguation, potentially difficult to extract  
>>> the correct relationship.
>>>
>>> I haven't done a lot of thinking about the proper way to  
>>> represent these relationships in RDF, I was just writing back to  
>>> a custom DB schema for now,
>> I don't know how to represent this in RDF. DBpedia defines one  
>> resource from each Wikipedia article, assuming that the topic of  
>> each article is some meaningful entity in the real world. This  
>> certainly doesn't hold for disambiguation pages, whose topic is  
>> not a single thing, but a multitude of things that happen to be  
>> related to some name, word, or term.
>>> but I think the information is highly valuable.
>> Can you give us some examples where you think this information  
>> could be used?
>>>  Also, similar to this, but easier to extract, is the synonym  
>>> information stored in the redirect links; are you currently  
>>> extracting multiple rdfs:label-s based on these redirects?
>> The next update will include dbpedia:redirectsTo triples for  
>> redirected articles.
>> Note that redirects are often not synonyms, but artifacts of  
>> Wikipedia's evolution. Redirects contain things like misspelled  
>> names, names that adhere to older naming conventions (e.g. the  
>> original WikiWords CamelCase naming convention), instances where  
>> multiple articles were folded into one etc. Thus they make poor  
>> labels.
>> Cheers,
>> Richard
>>>
>>> If you have a minute let me know your thoughts on this.
>>>
>>> Chris
>>>
>>>
>
>

Received on Monday, 12 November 2007 10:56:58 UTC