- From: Thad Guidry <thadguidry@gmail.com>
- Date: Sun, 12 Jul 2020 13:47:15 -0500
- To: Ivan Bashkirov <ivan.bashkirov@opencorporates.com>
- Cc: public-reconciliation@w3.org
- Message-ID: <CAChbWaPf0BEZKLBcFxh17wXoDk0NSgEcmLvGQ0jL4h+0Cz7Tfg@mail.gmail.com>
As we have mentioned before, Names are ambiguous <https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation#reconciling-with-additional-columns> and so giving more control to users (in clients) for boosting scores with rules or scripts will be useful. I will let others comment on various strategies, but one effective one is boosting scores where Numbers play a part <https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/engine.py>, since they are often not as ambiguous. So things like Dates, GUID's, Keys, Apartment #'s, etc. Speaking of Apartment #'s, often a String can be broken down into it's many parts and sometimes some of those more granular parts should affect the score. In general, what we saw in Freebase was that the more granular the data or sub data, the more effective it was at disambiguation since uniqueness increased oftentimes with increasing granularity. For example, it's useful to know what parts of a String are the most granular. 1001 Mill Lane Apt 120, Chicago, Illinois, USA 1001 Mill Lane Apt 102, Chicago, Illinois, USA Apple iPhone 10s Apple iPhone 10 The best systems understand this granularity automatically and often go way beyond simple tokenization to determine the parts, where tokens themselves are inspected to determine the highest granularity in a String pattern and automatically adjust the score based on submatching within. In the examples above, the most granular disambiguating tokens are the Apt number and the individual model # Oftentimes however, the Strings to reconcile are not granular enough. The best services will know this, and adjust. The best clients will know this, and offer hints to the user to partition their data into the most granular fields I hope this helps developers both in clients and services. Thad https://www.linkedin.com/in/thadguidry/ On Sun, Jul 12, 2020 at 9:00 AM Ivan Bashkirov < ivan.bashkirov@opencorporates.com> wrote: > Hi all, I have a question about approaches services are using to produce a > reconciliation score that is meaningful to the end users. > > Crucially, we want the users to know why the score is what it is, and how > they can make it better. As I understand, most reconciliation services > produce a somewhat abstract score from 0 to 100 that roughly translates as > "confidence", or "probability" that the result is the one a user is looking > for. It would be great to hear what strategies people are using to produce > the score. Here are a couple of examples from wikidata: > > If the values are coordinates (specified in the "lat,lng" format on >> OpenRefine's side), then the matching score is 100 when they are equal and >> decreases as their distance increases. Currently a score of 0 is reached >> when the points are 1km away from each other. > > >> If the values are integers, exact equality between integers is used. > > > In our case, we are doing company entity reconciliation.. We are > experimenting with parameters that include company name (score varies > depending on how closely the query string is matching the candidate), > address, active/inactive status, whether a company is a branch or not and > so on. Each parameter has a weighting and the final score is more or less a > weighted sum of those. > > It would be really interesting to see what others are doing, even though I > understand the approaches might be very different depending on the case. > What does the score you produce mean exactly? Is it the confidence that a > particular entity is the one a user is looking for? Is it simply a relative > score showing "goodness" of the match relative to other candidates? Or is > it based on some very specific rules like the wikidata examples above? > > Finally, as far as I can see there is nothing in Reconciliation API that > offers score explainability. Of course documentation for each particular > reconciliation service would likely be the primary machanism of explaining > how the score is produced. But I'm wondering if there is value of baking > something like that directly into Reconciliation API. Has this been > discussed? I am getting inspiration from Elasticsearch `_analyze` endpoint > which produces a breakdown of the score. > https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html > > Thanks in advance and I hope you all are well and safe, > > Ivan > > > >
Received on Sunday, 12 July 2020 18:47:38 UTC