- From: Antonin Delpeuch <antonin@delpeuch.eu>
- Date: Fri, 21 Jun 2019 11:48:18 +0100
- To: public-reconciliation@w3.org
- Message-ID: <ed82c47d-89c6-f1fb-6f85-21e1d882d49c@delpeuch.eu>
Thanks! So if I understand correctly these domains made it possible to tweak the scoring mechanism in various areas of Freebase? How does this notion of "domains" relate to types? Antonin On 6/21/19 2:30 AM, Thad Guidry wrote: > Wonderful paper Antonin! Read it thoroughly. > > So my thoughts... and a bit of historical context from Freebase (that > never got implemented, but that you hinted on in your paper). > > Freebase had knowledge that was classified additionally in Domains. > Domains were a 1st class citizen and you could even limit MQL search > queries to a particular domain(s). > For instance, you had an term "car" and could see that there were > multiple Domains where the name "car" appeared in like Automobile, > Transportation, Trains, and Insects. > Domains areas were not shown or exposed for candidates. Neither was > scoring or subscoring within a Domain context ever implemented. > However it was something David and I considered adding but never did > in OpenRefine. > > Domains tend to have their own stopwords, abbreviation styles, and > syntaxes. Many datasets that are published actually acknowledge this > such as NCBI's PubMed[1], Eurovoc subs[2], etc. > So I also think that optional scores for abbreviations, > include/exclude stopwords, etc. could be useful for Facets to expose. > Maybe even custom Q-grams against Abbreviations is something to > consider per Domain, but I have no opinion or evidence on that. > > Essentially, variances on names, abbreviations, etc. are highly > dependent on Databases where oftentimes Domain trickery plays an > important role and only the user will know where to place that importance. > Giving the user all options on the table when it comes to scoring > against the Domains subdata becomes highly important. > > From an API perspective, it would make sense to have optionals for > many things. > I think in Wikidata API you have "notable" ? but regardless the > hierarchy, the additional info could look something like: > > [{ > "id": "m456", > "name": "AA", > "full_name": "Analogs and Derivatives ", > "domain": "medicine", > "abbrev_match_score": 100 > }, > { > "id": "o123", > "name": "AA", > "full_name": "Alcoholics Anonymous", > "domain": "organizations", > "abbrev_match_score": 95 > }] > > [1] > - https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.stopwords/?report=objectonly > > [1] > - https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.mesh_subheadings/?report=objectonly > [1] - PHS 2-character grant abbreviation, e.g., LM > [1] - institute acronym, e.g., NLM NIH HHS > [2] - Sorry, lost my old links for disambiguating Eurovoc vocabulary > and subdata > > Thad > https://www.linkedin.com/in/thadguidry/ > > > > On Thu, Jun 20, 2019 at 11:14 AM Antonin Delpeuch <antonin@delpeuch.eu > <mailto:antonin@delpeuch.eu>> wrote: > > Hi all, > > I have just made public a short survey of the existing > reconciliation services and how their implementations relate to > the techniques used in record linkage (the corresponding academic > field): > > https://arxiv.org/abs/1906.08092 > > This is the outcome of a project done with OpenCorporates to plan > improvements to their own reconciliation service. The survey is > therefore focused on the perspective of the service provider, > rather than the end user, although the goal is of course to make > the service more useful to data consumers. I hope this can help > start the discussions (and give some definitions, as Ricardo > suggested). > > I have outlined a few suggestions, mostly around the issue of > scoring, which reflect my general feeling about this API: at the > moment service providers are a bit clueless about how to score > reconciliation candidates, and as a consequence users cannot > really rely on them in general. > > It would be great to also study what users actually do with these > services, to better understand their workflows. I am not sure how to > approach that, given that OpenRefine workflows are generally not > published. One possibility would be to analyze the logs of existing > reconciliation services (such as the one for Wikidata). Let me > know if you are interested in that sort of project. > > Àny feedback is welcome of course! > > Cheers, > > Antonin > >
Received on Friday, 21 June 2019 10:48:43 UTC