Re: DBpedia-based entity recognition service / tool? from Nathan on 2010-02-02 (public-lod@w3.org from February 2010)

From: Nathan <nathan@webr3.org>
Date: Tue, 02 Feb 2010 15:21:55 +0000
To: Ivan Herman <ivan@w3.org>
CC: Matthias Samwald <samwald@gmx.at>, public-lod@w3.org
Message-ID: <4B684313.6020605@webr3.org>
I should probably be replying here as I've been doing this, and working
on this for the past few months.

I've found from experience that the only viable way to address this need
is to do as follows:
1: Pass content through to both OpenCalais and Zemanta
2: Combine the results to provide a list of "string" terms to be
associated with dbpedia resources (where zemanta hasn't already done it)
3: Lookup each string resource and try and associate it to the string
4: Return all matches with results to the end user in order for them to
manually confirm the results.

Steps 3 and 4 are the killers here, because no matter how could the
service you can't always match to exact URIs (sometimes you can only
determine that you may mean one of X many ambiguous URIs); and in other
cases (approx 10% of the time) the wrong term is extracted by zemanta /
opencalais which skews results. For instance "Mrs London" may come back
as simply "London" which you'd take to mean
http://dbpedia.org/resource/London. Similarly disambiguous links often
mean different things in different contexts, which you can to some
extend infer by correlating the other extracted terms, but again you can
never get it perfect.

Thus as far as I can see, even when cutting out any ambiguous lookups,
this is always going to be a process that requires user confirmation.

On the lookup side of things I'd been using the API of
lookup.dbpedia.org written by Georgi; however I've recently found that
it delivers multiple results per lookup term more often than not. Hence
I've since been on a drive to create an alternative string based lookup
which can return back single unambiguous links more often than not.
Something I finally achieved over the weekend :-)

In reality (regardless of client restrictions on some of the code) I
don't think this is an API I could ever release, or that anybody could
release yet(?); what I may well be able to do though is open source the
lookup classes I've made & sparql queries behind them, allowing people
to run their own dbpedia URI lookup service.

To clarify why this per application lookup is probably the best
approach, string to resource matching is very much domain specific; in
once case when we say FOAF we all mean the ontology
<http://dbpedia.org/resource/FOAF_(software)> whereas in many places
they mean <http://dbpedia.org/resource/Friend_of_a_friend> which means
that we end up with the scenario:
   sitea:FOAF owl:sameAs dbp:FOAF_(software) ; rdfs:label "FOAF"@en .
   siteb:FOAF owl:sameAs dbp:Friend_of_a_friend ; rdfs:label "FOAF"@en .

Combine this with the fact that to provide anything near a usable
service, you'll need to cache look-ups and hit your own RDF store first
before querying dbpedia, means that we have situation. On a world-open
API level it makes sense to provide a disambiguous reply saying that
FOAF could mean either resource, but on a domain level it makes sense to
say we generally mean x and not y.

If you need further convincing I can supply literally hundreds of
use-cases where the resource implied by a string seems obvious but is in
fact disambiguous - even RDF has 15+ meanings, but when I say RDF I
always mean Resource Description Framework.

Further it also allows for domain specific String to Resources triples;
such as "Linked Open Data, Linking Open Data, LOD, Linked Data" (and
case variations) all meaning the same thing. Also, it allows for
terminology not yet in dbpedia; for instance "iPad" is used daily, but
isn't known over on dbpedia (or recognised by zemanta yet, and only open
calais can return it as they assign no meaning / resource association to
the strings, it's just a string).

Hope that helped a bit and if you have any questions or would like the
resource lookup code / sparql queries do let me know.

Regards,

Nathan

Ivan Herman wrote:
> Not providing an answer, but... if such tools are around, I would love
> to see them added to the SWSWiki[1]. At the moment, there is a generic
> category 'Tagging', with the following input:
> 
> http://www.w3.org/2001/sw/wiki/Category:Tagging
> 
> More would be good...
> 
> Ivan
> 
> [1] http://www.w3.org/2001/sw/wiki/
> 
> On 2010-2-2 13:26 , Matthias Samwald wrote:
>> Dear LOD community,
>>
>> I would be glad to hear your advice on how to best accomplish a simple
>> task: extracting DBpedia entities (identified with DBpedia URIs) from a
>> string of text. With good accuracy and recall, possibly with some
>> options to constraint the recognized entities to some subset of DBpedia,
>> based on categories. The tool or service should be performant enough to
>> process large numbers of strings in a reasonable amount of time.
>> Given the prolific creation of tiny tools and services in this community
>> I am puzzled about my inability to find anything that accomplishes this
>> task.
>> Could you point me to something like that? Are there tools/services for
>> Wikipedia that I could use?
>> Zemanta seems to be too much geared towards 'enhanced blogging', while
>> OpenCalais does not return Wikipedia/DBpedia identifiers. Please correct
>> me if I am wrong.
>>
>> Cheers,
>> Matthias
>>
>
Received on Tuesday, 2 February 2010 15:22:37 UTC