Re: Survey of OpenRefine reconciliation services

Wonderful paper Antonin!  Read it thoroughly.

So my thoughts... and a bit of historical context from Freebase (that never
got implemented, but that you hinted on in your paper).

Freebase had knowledge that was classified additionally in Domains.
Domains were a 1st class citizen and you could even limit MQL search
queries to a particular domain(s).
For instance, you had an term "car" and could see that there were multiple
Domains where the name "car" appeared in like Automobile, Transportation,
Trains, and Insects.
Domains areas were not shown or exposed for candidates.  Neither was
scoring or subscoring within a Domain context ever implemented.
However it was something David and I considered adding but never did in
OpenRefine.

Domains tend to have their own stopwords, abbreviation styles, and
syntaxes.  Many datasets that are published actually acknowledge this such
as NCBI's PubMed[1], Eurovoc subs[2], etc.
So I also think that optional scores for abbreviations, include/exclude
stopwords, etc. could be useful for Facets to expose.  Maybe even custom
Q-grams against Abbreviations is something to consider per Domain, but I
have no opinion or evidence on that.

Essentially, variances on names, abbreviations, etc. are highly dependent
on Databases where oftentimes Domain trickery plays an important role and
only the user will know where to place that importance.
Giving the user all options on the table when it comes to scoring against
the Domains subdata becomes highly important.

From an API perspective, it would make sense to have optionals for many
things.
I think in Wikidata API you have "notable" ? but regardless the hierarchy,
the additional info could look something like:

[{
"id": "m456",
"name": "AA",
"full_name": "Analogs and Derivatives ",
"domain": "medicine",
"abbrev_match_score": 100
},
{
"id": "o123",
"name": "AA",
"full_name": "Alcoholics Anonymous",
"domain": "organizations",
"abbrev_match_score": 95
}]

[1] -
https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.stopwords/?report=objectonly

[1] -
https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.mesh_subheadings/?report=objectonly
[1] - PHS 2-character grant abbreviation, e.g., LM
[1] - institute acronym, e.g., NLM NIH HHS
[2] - Sorry, lost my old links for disambiguating Eurovoc vocabulary and
subdata

Thad
https://www.linkedin.com/in/thadguidry/

On Thu, Jun 20, 2019 at 11:14 AM Antonin Delpeuch <antonin@delpeuch.eu>
wrote:

> Hi all,
>
> I have just made public a short survey of the existing reconciliation
> services and how their implementations relate to the techniques used in
> record linkage (the corresponding academic field):
>
> https://arxiv.org/abs/1906.08092
>
> This is the outcome of a project done with OpenCorporates to plan
> improvements to their own reconciliation service. The survey is therefore
> focused on the perspective of the service provider, rather than the end
> user, although the goal is of course to make the service more useful to
> data consumers. I hope this can help start the discussions (and give some
> definitions, as Ricardo suggested).
>
> I have outlined a few suggestions, mostly around the issue of scoring,
> which reflect my general feeling about this API: at the moment service
> providers are a bit clueless about how to score reconciliation candidates,
> and as a consequence users cannot really rely on them in general.
>
> It would be great to also study what users actually do with these
> services, to better understand their workflows. I am not sure how to
> approach that, given that OpenRefine workflows are generally not
> published. One possibility would be to analyze the logs of existing
> reconciliation services (such as the one for Wikidata). Let me know if you
> are interested in that sort of project.
>
> Àny feedback is welcome of course!
>
> Cheers,
>
> Antonin
>
>
>

Received on Friday, 21 June 2019 01:30:50 UTC