Re: General tuning for Dbpedia Spotlight from Pablo N. Mendes on 2014-01-18 (public-lod@w3.org from January 2014)

From: Pablo N. Mendes <pablomendes@gmail.com>
Date: Fri, 17 Jan 2014 17:49:55 -0800
To: Hugh Glaser <hugh@glasers.org>
Cc: public-lod community <public-lod@w3.org>, DBpediaSpotlight Users <dbp-spotlight-users@lists.sourceforge.net>
Message-ID: <CA+3KvkMj7R3csGXEpLNmPEViqV-Tedzto3wWKqtaZfQ+LLBMuA@mail.gmail.com>

Hi Hugh,
If you set the &confidence parameter higher, you should see less spurious
annotations at the cost of some recall. Nevertheless, there will always be
an error here and there (trust me, this is often hard even for humans to
do).

We have two approaches that are deployed and publicly available for testing
purposes:
http://spotlight.dbpedia.org/rest/annotate
http://spotlight.sztaki.hu:2222/rest/annotate

The former is from 2010 and the latter from 2012. Their confidence scores
are computed differently and have different ranges. You can use content
negotiation to ask for HTML, JSON, XML or Turtle. Example calls with cURL
below.

curl http://spotlight.dbpedia.org/rest/annotate -d "confidence=0.4" -d
"text=A proposed monument to Union soldiers at Olustee Battlefield Historic
State Park has enraged the area’s Confederate descendants"

curl http://spotlight.sztaki.hu:2222/rest/annotate -d "confidence=0.9" -d
"text=A proposed monument to Union soldiers at Olustee Battlefield Historic
State Park has enraged the area’s Confederate descendants"

Here "confidence" and "text" are just HTTP parameters. Just let us know if
you need further help. I am copying our discussion list here.

Cheers,
Pablo



On Fri, Jan 17, 2014 at 8:26 AM, Hugh Glaser <hugh@glasers.org> wrote:

> Thank you for the responses, both on- and off-list.
>
> So I see perhaps I should recast my question, with maybe wider scope.
>
> I have a load of abstract-style text fragments - that is perhaps 100 words
> each, on a wide variety of topics, although there is a bit of a technical
> bent.
>
> I want to be able to do linkage between them and to other things, based
> around our lovely Linked Data world.
> That is, have lots triples something like
> :docIDn :some-pred :conceptURI
> It would be a bonus to know which words in the text triggered the
> generation of the triple.
> Of course, the system doesn’t actually have to generate the triples - I
> can build them if I get sufficiently sensible output, including the sort of
> html output that Spotlight does.
> And because it goes automatically to users, I need quite high precision,
> even if recall suffers (I think is the terminology).
> Oh, and ideally free, although not necessarily.
> My current preference is for dbpedia or freebase URIs, but wordnet is
> probably OK too.
>
> I think this must be something that there are people who have done this (a
> lot). Or at least there should be.
> There are certainly quite a lot of systems that can do it, some more or
> less playing well with Linked Data URIs.
>
> I think my problem (apart from laziness) is that the systems I look at
> seem to want me to care about what they do, or at least engage with tuning
> and things, which means I need some understanding of what they do, which I
> don’t have (and I probably don’t care either :-) ).
>
> So, does anyone (else) feel they can point me at a system for doing this
> that I can just use out of the box (possibly having been told some
> parameters to use)?
>
> Of course, maybe I am just asking too much of the technology at the
> moment, but I can hope!
> Best
> Hugh
> --
> Hugh Glaser
>    20 Portchester Rise
>    Eastleigh
>    SO50 4QS
> Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
>
>
>
>


-- 

Pablo N. Mendes
http://pablomendes.com

Received on Saturday, 18 January 2014 01:50:23 UTC