Re: Survey of OpenRefine reconciliation services

Hi Vladimir,

Thanks a lot for the feedback!

On 6/25/19 3:44 PM, Vladimir Alexiev wrote:

- services can and should score even on the absence of fields, based on
the importance and popularity of entities. Eg a Paris or Rembrandt
should return the respective famous entities, not one of the hundred
other possible matches; inactive companies should (and are) downgraded, etc
Absolutely! I think it would be useful for reconciliation services to be
able to expose such popularity features individually, rather than
aggregating them in a single opaque score.
> - if a dataset has multiple labels per entity (e.g VIAF has over 100
> for Cranach), it should take label nature into account
> - VIAF has life years (and if I'm not mistaken gender) , which are
> important to use.
> - It even has Occupation and Nationality, which however are part of
> the label and not normalized. (Our service extracts them to separate
> fields)

Yes, I did not want to go too much into the details of each specific
fields for each specific database of course. Is your VIAF service
available publicly? It would be great to add it to
https://reconciliation-api.github.io/testbench/.

> - that H2020 project is not related to ERC

Good catch, I will fix that.

Antonin


>
> On Thu, Jun 20, 2019, 18:14 Antonin Delpeuch <antonin@delpeuch.eu
> <mailto:antonin@delpeuch.eu>> wrote:
>
>     Hi all,
>
>     I have just made public a short survey of the existing
>     reconciliation services and how their implementations relate to
>     the techniques used in record linkage (the corresponding academic
>     field):
>
>     https://arxiv.org/abs/1906.08092
>
>     This is the outcome of a project done with OpenCorporates to plan
>     improvements to their own reconciliation service. The survey is
>     therefore focused on the perspective of the service provider,
>     rather than the end user, although the goal is of course to make
>     the service more useful to data consumers. I hope this can help
>     start the discussions (and give some definitions, as Ricardo
>     suggested).
>
>     I have outlined a few suggestions, mostly around the issue of
>     scoring, which reflect my general feeling about this API: at the
>     moment service providers are a bit clueless about how to score
>     reconciliation candidates, and as a consequence users cannot
>     really rely on them in general.
>
>     It would be great to also study what users actually do with these
>     services, to better understand their workflows. I am not sure how to
>     approach that, given that OpenRefine workflows are generally not
>     published. One possibility would be to analyze the logs of existing
>     reconciliation services (such as the one for Wikidata). Let me
>     know if you are interested in that sort of project.
>
>     Àny feedback is welcome of course!
>
>     Cheers,
>
>     Antonin
>
>

Received on Saturday, 29 June 2019 07:57:12 UTC