Re: DBpedia-based entity recognition service / tool? from Juan Sequeda on 2010-02-04 (public-lod@w3.org from February 2010)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Thu, 4 Feb 2010 03:18:00 -0600
To: Matthias Samwald <samwald@gmx.at>
Cc: public-lod@w3.org
Message-ID: <f914914c1002040118x5efad120n15fa27eebc524bc8@mail.gmail.com>

Hi Matthias,

We worked on something similar: entity type discovery using linked open
data.

Our project was given a corpus of documents in the same domain, identify
specific entity types in the documents. Our objective was to search for
documents in a corpus by specific entities. For example: "find articles that
are about RDBMs"

Standard NER tools identify high level types such as persons, organization,
places because they have been previously trained on general corpora. I
assume tools like OpenCalais have been trained on news-like documents and
Zemanta has been trained on blog-like documents.

We were interested in identifying specific types such a "RDBMS" when the
word "Oracle" would show up in the text. In order to do that, we followed
several domain term extraction techniques. We used LOD, specifically
DBpedia, Freebase and Opencyc to disambiguate terms and also retrieve the
entities. Honestly, evaluation is pretty hard to do, but our current
implementation was not that bad (75% precision and 55% recall).

We built upon some work by IBM where they create a vocabulary from text
using LOD [1]

Let me see if I can clean up the code and publish it as a service.

[1] http://data.semanticweb.org/conference/iswc/2009/paper/inuse/143/html

Juan Sequeda
(575) SEQ-UEDA
www.juansequeda.com

On Tue, Feb 2, 2010 at 6:26 AM, Matthias Samwald <samwald@gmx.at> wrote:

> Dear LOD community,
>
> I would be glad to hear your advice on how to best accomplish a simple
> task: extracting DBpedia entities (identified with DBpedia URIs) from a
> string of text. With good accuracy and recall, possibly with some options to
> constraint the recognized entities to some subset of DBpedia, based on
> categories. The tool or service should be performant enough to process large
> numbers of strings in a reasonable amount of time.
> Given the prolific creation of tiny tools and services in this community I
> am puzzled about my inability to find anything that accomplishes this task.
> Could you point me to something like that? Are there tools/services for
> Wikipedia that I could use?
> Zemanta seems to be too much geared towards 'enhanced blogging', while
> OpenCalais does not return Wikipedia/DBpedia identifiers. Please correct me
> if I am wrong.
>
> Cheers,
> Matthias
>
>

Received on Thursday, 4 February 2010 09:18:34 UTC