Re: Survey of OpenRefine reconciliation services from Antonin Delpeuch on 2019-06-21 (public-reconciliation@w3.org from June 2019)

From: Antonin Delpeuch <antonin@delpeuch.eu>
Date: Fri, 21 Jun 2019 11:48:18 +0100
To: public-reconciliation@w3.org
Message-ID: <ed82c47d-89c6-f1fb-6f85-21e1d882d49c@delpeuch.eu>
Thanks! So if I understand correctly these domains made it possible to
tweak the scoring mechanism in various areas of Freebase?

How does this notion of "domains" relate to types?

Antonin

On 6/21/19 2:30 AM, Thad Guidry wrote:
> Wonderful paper Antonin!  Read it thoroughly.
>
> So my thoughts... and a bit of historical context from Freebase (that
> never got implemented, but that you hinted on in your paper).
>
> Freebase had knowledge that was classified additionally in Domains. 
> Domains were a 1st class citizen and you could even limit MQL search
> queries to a particular domain(s).
> For instance, you had an term "car" and could see that there were
> multiple Domains where the name "car" appeared in like Automobile,
> Transportation, Trains, and Insects.
> Domains areas were not shown or exposed for candidates.  Neither was
> scoring or subscoring within a Domain context ever implemented.
> However it was something David and I considered adding but never did
> in OpenRefine.
>
> Domains tend to have their own stopwords, abbreviation styles, and
> syntaxes.  Many datasets that are published actually acknowledge this
> such as NCBI's PubMed[1], Eurovoc subs[2], etc.
> So I also think that optional scores for abbreviations,
> include/exclude stopwords, etc. could be useful for Facets to expose. 
> Maybe even custom Q-grams against Abbreviations is something to
> consider per Domain, but I have no opinion or evidence on that.
>
> Essentially, variances on names, abbreviations, etc. are highly
> dependent on Databases where oftentimes Domain trickery plays an
> important role and only the user will know where to place that importance.
> Giving the user all options on the table when it comes to scoring
> against the Domains subdata becomes highly important.
>
> From an API perspective, it would make sense to have optionals for
> many things.
> I think in Wikidata API you have "notable" ? but regardless the
> hierarchy, the additional info could look something like:
>
> [{
> "id": "m456",
> "name": "AA",
> "full_name": "Analogs and Derivatives ",  
> "domain": "medicine",
> "abbrev_match_score": 100
> },
> {
> "id": "o123",
> "name": "AA",
> "full_name": "Alcoholics Anonymous",
> "domain": "organizations",
> "abbrev_match_score": 95
> }]
>
> [1]
> - https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.stopwords/?report=objectonly 
>  
> [1]
> - https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.mesh_subheadings/?report=objectonly
> [1] - PHS 2-character grant abbreviation, e.g., LM
> [1] - institute acronym, e.g., NLM NIH HHS
> [2] - Sorry, lost my old links for disambiguating Eurovoc vocabulary
> and subdata
>
> Thad
> https://www.linkedin.com/in/thadguidry/
>
>
>
> On Thu, Jun 20, 2019 at 11:14 AM Antonin Delpeuch <antonin@delpeuch.eu
> <mailto:antonin@delpeuch.eu>> wrote:
>
>     Hi all,
>
>     I have just made public a short survey of the existing
>     reconciliation services and how their implementations relate to
>     the techniques used in record linkage (the corresponding academic
>     field):
>
>     https://arxiv.org/abs/1906.08092
>
>     This is the outcome of a project done with OpenCorporates to plan
>     improvements to their own reconciliation service. The survey is
>     therefore focused on the perspective of the service provider,
>     rather than the end user, although the goal is of course to make
>     the service more useful to data consumers. I hope this can help
>     start the discussions (and give some definitions, as Ricardo
>     suggested).
>
>     I have outlined a few suggestions, mostly around the issue of
>     scoring, which reflect my general feeling about this API: at the
>     moment service providers are a bit clueless about how to score
>     reconciliation candidates, and as a consequence users cannot
>     really rely on them in general.
>
>     It would be great to also study what users actually do with these
>     services, to better understand their workflows. I am not sure how to
>     approach that, given that OpenRefine workflows are generally not
>     published. One possibility would be to analyze the logs of existing
>     reconciliation services (such as the one for Wikidata). Let me
>     know if you are interested in that sort of project.
>
>     Àny feedback is welcome of course!
>
>     Cheers,
>
>     Antonin
>
>
Received on Friday, 21 June 2019 10:48:43 UTC