Re: Survey of OpenRefine reconciliation services

I don't know all the details since the scoring algorithms were never
publicly disclosed completely.
However, yes, there were different scores computed against popularity and
relevancy and many separate indexes (with/without stemming, stopwords,
etc.) were used & maintained in Lucene and elsewhere to help with this by
Andi Vajda.
The relevancy scores were determined by the user if they used "entity",
"schema", or "freebase".
https://developers.google.com/freebase/v1/search-cookbook#scoring-and-ranking


We had constraints by type and domain parameters.
https://developers.google.com/freebase/v1/search-cookbook

https://www.freebase.com/fictional_universe/fictional_organization?props=&lang=en&filter=%2Fcommon%2Ftopic%2Falias&all=true#/common/topic

Here's some of many historical email threads (filtered against "andi"):
https://groups.google.com/forum/#!searchin/freebase-discuss/freebase$20andi$20scoring%7Csort:date

https://groups.google.com/forum/#!searchin/freebase-discuss/freebase$20andi$20schema%7Csort:date
https://groups.google.com/forum/#!searchin/freebase-discuss/freebase$20andi$20namespace%7Csort:date


I miss it so so much :( :(
Thad
https://www.linkedin.com/in/thadguidry/


On Fri, Jun 21, 2019 at 5:48 AM Antonin Delpeuch <antonin@delpeuch.eu>
wrote:

> Thanks! So if I understand correctly these domains made it possible to
> tweak the scoring mechanism in various areas of Freebase?
>
> How does this notion of "domains" relate to types?
>
> Antonin
> On 6/21/19 2:30 AM, Thad Guidry wrote:
>
> Wonderful paper Antonin!  Read it thoroughly.
>
> So my thoughts... and a bit of historical context from Freebase (that
> never got implemented, but that you hinted on in your paper).
>
> Freebase had knowledge that was classified additionally in Domains.
> Domains were a 1st class citizen and you could even limit MQL search
> queries to a particular domain(s).
> For instance, you had an term "car" and could see that there were multiple
> Domains where the name "car" appeared in like Automobile, Transportation,
> Trains, and Insects.
> Domains areas were not shown or exposed for candidates.  Neither was
> scoring or subscoring within a Domain context ever implemented.
> However it was something David and I considered adding but never did in
> OpenRefine.
>
> Domains tend to have their own stopwords, abbreviation styles, and
> syntaxes.  Many datasets that are published actually acknowledge this such
> as NCBI's PubMed[1], Eurovoc subs[2], etc.
> So I also think that optional scores for abbreviations, include/exclude
> stopwords, etc. could be useful for Facets to expose.  Maybe even custom
> Q-grams against Abbreviations is something to consider per Domain, but I
> have no opinion or evidence on that.
>
> Essentially, variances on names, abbreviations, etc. are highly dependent
> on Databases where oftentimes Domain trickery plays an important role and
> only the user will know where to place that importance.
> Giving the user all options on the table when it comes to scoring against
> the Domains subdata becomes highly important.
>
> From an API perspective, it would make sense to have optionals for many
> things.
> I think in Wikidata API you have "notable" ? but regardless the hierarchy,
> the additional info could look something like:
>
> [{
> "id": "m456",
> "name": "AA",
> "full_name": "Analogs and Derivatives ",
> "domain": "medicine",
> "abbrev_match_score": 100
> },
> {
> "id": "o123",
> "name": "AA",
> "full_name": "Alcoholics Anonymous",
> "domain": "organizations",
> "abbrev_match_score": 95
> }]
>
> [1] -
> https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.stopwords/?report=objectonly
>
> [1] -
> https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.mesh_subheadings/?report=objectonly
> [1] - PHS 2-character grant abbreviation, e.g., LM
> [1] - institute acronym, e.g., NLM NIH HHS
> [2] - Sorry, lost my old links for disambiguating Eurovoc vocabulary and
> subdata
>
> Thad
> https://www.linkedin.com/in/thadguidry/
>
>
> On Thu, Jun 20, 2019 at 11:14 AM Antonin Delpeuch <antonin@delpeuch.eu>
> wrote:
>
>> Hi all,
>>
>> I have just made public a short survey of the existing reconciliation
>> services and how their implementations relate to the techniques used in
>> record linkage (the corresponding academic field):
>>
>> https://arxiv.org/abs/1906.08092
>>
>> This is the outcome of a project done with OpenCorporates to plan
>> improvements to their own reconciliation service. The survey is therefore
>> focused on the perspective of the service provider, rather than the end
>> user, although the goal is of course to make the service more useful to
>> data consumers. I hope this can help start the discussions (and give some
>> definitions, as Ricardo suggested).
>>
>> I have outlined a few suggestions, mostly around the issue of scoring,
>> which reflect my general feeling about this API: at the moment service
>> providers are a bit clueless about how to score reconciliation candidates,
>> and as a consequence users cannot really rely on them in general.
>>
>> It would be great to also study what users actually do with these
>> services, to better understand their workflows. I am not sure how to
>> approach that, given that OpenRefine workflows are generally not
>> published. One possibility would be to analyze the logs of existing
>> reconciliation services (such as the one for Wikidata). Let me know if
>> you are interested in that sort of project.
>>
>> Àny feedback is welcome of course!
>>
>> Cheers,
>>
>> Antonin
>>
>>
>>

Received on Friday, 21 June 2019 12:36:17 UTC