- From: Tom Morris <tfmorris@gmail.com>
- Date: Sun, 12 Jul 2020 21:51:05 -0400
- To: Ivan Bashkirov <ivan.bashkirov@opencorporates.com>
- Cc: public-reconciliation@w3.org
- Message-ID: <CAE9vqEEG_GD+-cPNXVku_SG-=V_mRcjigVoC7BAMVAud7bxbLg@mail.gmail.com>
Hi Ivan, On Sun, Jul 12, 2020 at 10:00 AM Ivan Bashkirov < ivan.bashkirov@opencorporates.com> wrote: > Hi all, I have a question about approaches services are using to produce a > reconciliation score that is meaningful to the end users. > > Crucially, we want the users to know why the score is what it is, and how > they can make it better. As I understand, most reconciliation services > produce a somewhat abstract score from 0 to 100 that roughly translates as > "confidence", or "probability" that the result is the one a user is looking > for. It would be great to hear what strategies people are using to produce > the score. > ... > In our case, we are doing company entity reconciliation.. We are > experimenting with parameters that include company name (score varies > depending on how closely the query string is matching the candidate), > address, active/inactive status, whether a company is a branch or not and > so on. Each parameter has a weighting and the final score is more or less a > weighted sum of those. > A weighted/scaled distance metric is pretty typical. Obviously the weights are of critical importance. I think there are a few different things that it's valuable to convey to the user, if possible: - Ranking of the returned choices - this only depends on relative scores, not their absolute values - Confusable candidates - it's valuable if the relative scores help distinguish cases that might require more careful checking from those that can be automatically trusted - Low quality candidates - it's valuable to have some type of threshold, whether it be fixed or something that the users learn based on their experience. Finally, as far as I can see there is nothing in Reconciliation API that > offers score explainability. Of course documentation for each particular > reconciliation service would likely be the primary machanism of explaining > how the score is produced. But I'm wondering if there is value of baking > something like that directly into Reconciliation API. Has this been > discussed? I am getting inspiration from Elasticsearch `_analyze` endpoint > which produces a breakdown of the score. > https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html > I read that as more an explanation of the transformation/normalization pipeline than the scoring mechanism. This makes sense for Elasticsearch, because you can construct chains of transformations which are hidden in the background. For scoring however, they take a different approach and put the power in the users hand by allowing them to construct complex queries with their own weighting algorithms embedded. I suspect that's too sophisticated for most of the users of reconciliation services, but perhaps there are simple controls like choosing among exact, prefix, and approximate string matches, etc. I'll be interested to hear the kinds of scoring metrics people have implemented. My gut feeling is that most of them are pretty basic. Tom
Received on Monday, 13 July 2020 01:51:29 UTC