Re: [ANN] Uberblic Search API from Tom Morris on 2010-07-21 (public-lod@w3.org from July 2010)

From: Tom Morris <tfmorris@gmail.com>
Date: Wed, 21 Jul 2010 13:46:40 -0400
To: Georgi Kobilarov <georgi.kobilarov@gmx.de>
Cc: Dave Reynolds <dave.e.reynolds@googlemail.com>, public-lod@w3.org
Message-ID: <AANLkTik-DxLiEIKVe9s_n8qZ2i5OFfNw4yNEGol-oQrA@mail.gmail.com>
On Wed, Jul 21, 2010 at 7:24 AM, Georgi Kobilarov
<georgi.kobilarov@gmx.de> wrote:

>> A suggestion would be to add some form of disambiguating description to
>> the keyword completion. If I type "Scarlet" then the completion options look
>> something like:
>>
>>    Scarlett
>>    Scarlett Johansson
>>    Scarlett Johansson
>>    Scarlett
>>    Scarlett
>>    Scarlett Johansson
>>    ...
>>    Scarlett Johansson
>>
>> Each of the four Johansson entries seems to bring up a different display but
>> it's hard to work out which is the right one to use and the order is
>> unpredictable.
>
> Thanks for the feedback. The actual API is more talkative than our little javascript search box, see
> http://platform.uberblic.org/api/search?query=Scarlett
>
> that response contains metadata about the entities' type, source, often an abstract, and will later also contain an image link and a search score. The search results html page http://platform.uberblic.org/?search=scarlett displays that metadata too. So that helps at least to disambiguate Scarlett the actress from Scarlett the book, the song or the pub in Texas.
>

You might want to look at Freebase Suggest as an example of a useful
display for this type of task.  I'm not sure whether recent versions
are open source or not, but you can find some earlier BSD licensed
versions at http://code.google.com/p/freebase-suggest/source/browse/
In any case, the content and features are worth emulating.

> But you're touching on a much more important question: why are there 4 Scarlett Johansson entities in Uberblic?
> The answer is easy: they haven't been consolidated yet.

Did someone merge them after this discussion started?  I'm only seeing one now.

When I look at the list of merged resources, I see:

  Scarlett Johansson (Wikipedia, Freebase, The Movie DB, MusicBrainz,
Wikipedia DE)
  Scarlett Johansson (The Movie DB)
  Scarlett Johansson (Wikipedia)
  Scarlett Johansson (MusicBrainz)
  스칼릿 조핸슨 (Freebase)

Can you explain that's in the parentheses?  What's the significance of
multiple entries from the same sources (Wikipeda, MusicBrainz,
Freebase, The Movie DB)?  Does it mean that each of those had
duplicates originally as well?

> For developers that means: pick any URI that refers to the entity you mean (any of Scarlett Johanssons above) and you'll be fine.
> In practice, that is: if you're building a movie applications, always pick the uberblic entity from The Movie DB.

How does own determine which types correlate with which collections of
URIs as "best"?  Is this information encoded in machine-readable
format someplace or is it just that humans know that a database called
"movie" as got to be best for a type called "actor"?

> That uberblic URI will always be valid and in the case of a merge that URI will be redirected.
> And there's an API to track consolidation events as well http://uberblic.org/developers/apis/uri-consolidation-feed/

What about splits?  It's not uncommon for a Wikipedia article to be
about two (or several) things and this problem propagates to
downstream databases like Freebase and DBpedia (and presumably
Uberlic).  What happens when someone splits an article/topic/concept?

> But of course we are doing our best to consolidate as many entities and as fast as we can.
>
> At the bottom of http://platform.uberblic.org/ (and in the consolidation API feed)
> you'll see the results of our duplicate detection engine Doppelganger, as well as
> consolidations initiated by users.

Presumably it only does this for merges that it's supremely confident
in.  Bad merges are the bane of any system like this because it tends
to be much harder to unmerge than merge.  What happens to things that
are pretty likely merge candidates, say 95%, but not likely enough to
merge automatically.  Is there a human supervised queue somewhere for
these?

> Doppelganger is continuously traversing the uberblic graph to identify and merge equivalent resources.
> Last week it was mainly places in Geonames, Foursquare and Freebase, at the moment it's people.

Speaking of Freebase, does it automatically pick up the merges and
splits generated by the Freebase community?  It doesn't make any sense
to do the work twice.

Looking at the HTML display, shouldn't the references to CC-BY
resources be hot links?  Even though it's not required for
MusicBrainz, it would seem like the polite thing to do for them too.

Tom
Received on Wednesday, 21 July 2010 17:47:15 UTC