- From: Antonin Delpeuch <antonin@delpeuch.eu>
- Date: Fri, 21 Jun 2019 11:48:18 +0100
- To: public-reconciliation@w3.org
- Message-ID: <ed82c47d-89c6-f1fb-6f85-21e1d882d49c@delpeuch.eu>
Thanks! So if I understand correctly these domains made it possible to
tweak the scoring mechanism in various areas of Freebase?
How does this notion of "domains" relate to types?
Antonin
On 6/21/19 2:30 AM, Thad Guidry wrote:
> Wonderful paper Antonin! Read it thoroughly.
>
> So my thoughts... and a bit of historical context from Freebase (that
> never got implemented, but that you hinted on in your paper).
>
> Freebase had knowledge that was classified additionally in Domains.
> Domains were a 1st class citizen and you could even limit MQL search
> queries to a particular domain(s).
> For instance, you had an term "car" and could see that there were
> multiple Domains where the name "car" appeared in like Automobile,
> Transportation, Trains, and Insects.
> Domains areas were not shown or exposed for candidates. Neither was
> scoring or subscoring within a Domain context ever implemented.
> However it was something David and I considered adding but never did
> in OpenRefine.
>
> Domains tend to have their own stopwords, abbreviation styles, and
> syntaxes. Many datasets that are published actually acknowledge this
> such as NCBI's PubMed[1], Eurovoc subs[2], etc.
> So I also think that optional scores for abbreviations,
> include/exclude stopwords, etc. could be useful for Facets to expose.
> Maybe even custom Q-grams against Abbreviations is something to
> consider per Domain, but I have no opinion or evidence on that.
>
> Essentially, variances on names, abbreviations, etc. are highly
> dependent on Databases where oftentimes Domain trickery plays an
> important role and only the user will know where to place that importance.
> Giving the user all options on the table when it comes to scoring
> against the Domains subdata becomes highly important.
>
> From an API perspective, it would make sense to have optionals for
> many things.
> I think in Wikidata API you have "notable" ? but regardless the
> hierarchy, the additional info could look something like:
>
> [{
> "id": "m456",
> "name": "AA",
> "full_name": "Analogs and Derivatives ",
> "domain": "medicine",
> "abbrev_match_score": 100
> },
> {
> "id": "o123",
> "name": "AA",
> "full_name": "Alcoholics Anonymous",
> "domain": "organizations",
> "abbrev_match_score": 95
> }]
>
> [1]
> - https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.stopwords/?report=objectonly
>
> [1]
> - https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.mesh_subheadings/?report=objectonly
> [1] - PHS 2-character grant abbreviation, e.g., LM
> [1] - institute acronym, e.g., NLM NIH HHS
> [2] - Sorry, lost my old links for disambiguating Eurovoc vocabulary
> and subdata
>
> Thad
> https://www.linkedin.com/in/thadguidry/
>
>
>
> On Thu, Jun 20, 2019 at 11:14 AM Antonin Delpeuch <antonin@delpeuch.eu
> <mailto:antonin@delpeuch.eu>> wrote:
>
> Hi all,
>
> I have just made public a short survey of the existing
> reconciliation services and how their implementations relate to
> the techniques used in record linkage (the corresponding academic
> field):
>
> https://arxiv.org/abs/1906.08092
>
> This is the outcome of a project done with OpenCorporates to plan
> improvements to their own reconciliation service. The survey is
> therefore focused on the perspective of the service provider,
> rather than the end user, although the goal is of course to make
> the service more useful to data consumers. I hope this can help
> start the discussions (and give some definitions, as Ricardo
> suggested).
>
> I have outlined a few suggestions, mostly around the issue of
> scoring, which reflect my general feeling about this API: at the
> moment service providers are a bit clueless about how to score
> reconciliation candidates, and as a consequence users cannot
> really rely on them in general.
>
> It would be great to also study what users actually do with these
> services, to better understand their workflows. I am not sure how to
> approach that, given that OpenRefine workflows are generally not
> published. One possibility would be to analyze the logs of existing
> reconciliation services (such as the one for Wikidata). Let me
> know if you are interested in that sort of project.
>
> Àny feedback is welcome of course!
>
> Cheers,
>
> Antonin
>
>
Received on Friday, 21 June 2019 10:48:43 UTC