RE: [ANN] Uberblic Search API

Hi Tom,

> You might want to look at Freebase Suggest as an example of a useful
> display for this type of task.  I'm not sure whether recent versions
> are open source or not, but you can find some earlier BSD licensed
> versions at http://code.google.com/p/freebase-suggest/source/browse/
> In any case, the content and features are worth emulating.

Absolutely, to provide something similar to Freebase Suggest is what we
have in mind. And just started by providing the Search API. Autocomplete
Search widgets are very important IMHO.

 
> > But you're touching on a much more important question: why are there
> 4 Scarlett Johansson entities in Uberblic?
> > The answer is easy: they haven't been consolidated yet.
> 
> Did someone merge them after this discussion started?  I'm only seeing
> one now.

Yes, they have been merged. 

 
> When I look at the list of merged resources, I see:
> 
>   Scarlett Johansson (Wikipedia, Freebase, The Movie DB, MusicBrainz,
> Wikipedia DE)
>   Scarlett Johansson (The Movie DB)
>   Scarlett Johansson (Wikipedia)
>   Scarlett Johansson (MusicBrainz)
>   ½ºÄ®¸´ Á¶ÇÚ½¼ (Freebase)
> 
> Can you explain that's in the parentheses?  What's the significance of
> multiple entries from the same sources (Wikipeda, MusicBrainz,
> Freebase, The Movie DB)?  Does it mean that each of those had
> duplicates originally as well?

The entity with the many sources has the now global URI for that entity,
while the other entities are merge-loosers and their URIs are now
redirects. In this example you can see that the merges entity has the URI
which was originally assigned to the entity from Wikipedia.

The frontend presentation isn't very helpful to new users though, I admit.

> > For developers that means: pick any URI that refers to the entity you
> mean (any of Scarlett Johanssons above) and you'll be fine.
> > In practice, that is: if you're building a movie applications, always
> pick the uberblic entity from The Movie DB.
> 
> How does own determine which types correlate with which collections of
> URIs as "best"?  Is this information encoded in machine-readable
> format someplace or is it just that humans know that a database called
> "movie" as got to be best for a type called "actor"?

Good point. Currently the user has to know. The dashboard
(http://platform.uberblic.org) does provide a little support to figuring
that out. The metaphor for the two stats boxes on the top is "input -
output", and if you click e.g. on Geonames in the left box, you'll see that
Geonames only contributes entities of type Place to the repository,
Musicbrainz mostly songs, albums, persons, band, etc.

 
> > That uberblic URI will always be valid and in the case of a merge
> that URI will be redirected.
> > And there's an API to track consolidation events as well
> http://uberblic.org/developers/apis/uri-consolidation-feed/
> 
> What about splits?  It's not uncommon for a Wikipedia article to be
> about two (or several) things and this problem propagates to
> downstream databases like Freebase and DBpedia (and presumably
> Uberlic).  What happens when someone splits an article/topic/concept?

I didn¡¯t find splits to be very common in data sources other than
Wikipedia. Which doesn't mean that we should ignore them...

> > But of course we are doing our best to consolidate as many entities
> and as fast as we can.
> >
> > At the bottom of http://platform.uberblic.org/ (and in the
> consolidation API feed)
> > you'll see the results of our duplicate detection engine
> Doppelganger, as well as
> > consolidations initiated by users.
> 
> Presumably it only does this for merges that it's supremely confident
> in.  

Yes! There probably is an optimal precision/recall point somewhere, but at
the moment we are very far into the precision range of that spectrum.

> Bad merges are the bane of any system like this because it tends
> to be much harder to unmerge than merge.  What happens to things that
> are pretty likely merge candidates, say 95%, but not likely enough to
> merge automatically.  Is there a human supervised queue somewhere for
> these?

That is the approach, yes. The review queue isn't public yet because we
didn't get the UI right (ended up with false merges by test users). But
we'll get there. 
 
> > Doppelganger is continuously traversing the uberblic graph to
> identify and merge equivalent resources.
> > Last week it was mainly places in Geonames, Foursquare and Freebase,
> at the moment it's people.
> 
> Speaking of Freebase, does it automatically pick up the merges and
> splits generated by the Freebase community?  It doesn't make any sense
> to do the work twice.

We are treating the merges and sameas links in data sources as hints for
our reconciliation. I don't know whether we want to optimize specifically
for Freebase though. 

> Looking at the HTML display, shouldn't the references to CC-BY
> resources be hot links?  Even though it's not required for
> MusicBrainz, it would seem like the polite thing to do for them too.

Oh yes, you're right. the links to the original resources / document are
hidden behind the 'sources details' link at the moment. will make that more
prominent.

Cheers,
Georgi

 --
Georgi Kobilarov
Uberblic Labs Berlin
http://kobilarov.com

Received on Wednesday, 28 July 2010 20:22:12 UTC