Producting explainable reconciliation score from Ivan Bashkirov on 2020-07-12 (public-reconciliation@w3.org from July 2020)

From: Ivan Bashkirov <ivan.bashkirov@opencorporates.com>
Date: Sun, 12 Jul 2020 14:58:13 +0100
To: public-reconciliation@w3.org
Message-ID: <CAD_fvePStoj4Md+sACyZ7HqC8+orz1ZHtaypVtVQsANSJzvZeg@mail.gmail.com>

Hi all, I have a question about approaches services are using to produce a
reconciliation score that is meaningful to the end users.

Crucially, we want the users to know why the score is what it is, and how
they can make it better.  As I understand, most reconciliation services
produce a somewhat abstract score from 0 to 100 that roughly translates as
"confidence", or "probability" that the result is the one a user is looking
for. It would be great to hear what strategies people are using to produce
the score. Here are a couple of examples from wikidata:

If the values are coordinates (specified in the "lat,lng" format on
> OpenRefine's side), then the matching score is 100 when they are equal and
> decreases as their distance increases. Currently a score of 0 is reached
> when the points are 1km away from each other.


> If the values are integers, exact equality between integers is used.


In our case, we are doing company entity reconciliation. We are
experimenting with parameters that include company name (score varies
depending on how closely the query string is matching the candidate),
address, active/inactive status, whether a company is a branch or not and
so on. Each parameter has a weighting and the final score is more or less a
weighted sum of those.

It would be really interesting to see what others are doing, even though I
understand the approaches might be very different depending on the case.
What does the score you produce mean exactly? Is it the confidence that a
particular entity is the one a user is looking for? Is it simply a relative
score showing "goodness" of the match relative to other candidates? Or is
it based on some very specific rules like the wikidata examples above?

Finally, as far as I can see there is nothing in Reconciliation API that
offers score explainability. Of course documentation for each particular
reconciliation service would likely be the primary machanism of explaining
how the score is produced. But I'm wondering if there is value of baking
something like that directly into Reconciliation API. Has this been
discussed? I am getting inspiration from Elasticsearch `_analyze` endpoint
which produces a breakdown of the score.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

Thanks in advance and I hope you all are well and safe,

Ivan

Received on Sunday, 12 July 2020 14:00:21 UTC