- From: Thad Guidry <thadguidry@gmail.com>
- Date: Fri, 21 Jun 2019 07:35:39 -0500
- To: Antonin Delpeuch <antonin@delpeuch.eu>
- Cc: public-reconciliation@w3.org
- Message-ID: <CAChbWaMZ37ijGDw-n1Oc-7O4O1ouJjoeeTg3=ZY7-JXx0VOEPQ@mail.gmail.com>
I don't know all the details since the scoring algorithms were never publicly disclosed completely. However, yes, there were different scores computed against popularity and relevancy and many separate indexes (with/without stemming, stopwords, etc.) were used & maintained in Lucene and elsewhere to help with this by Andi Vajda. The relevancy scores were determined by the user if they used "entity", "schema", or "freebase". https://developers.google.com/freebase/v1/search-cookbook#scoring-and-ranking We had constraints by type and domain parameters. https://developers.google.com/freebase/v1/search-cookbook https://www.freebase.com/fictional_universe/fictional_organization?props=&lang=en&filter=%2Fcommon%2Ftopic%2Falias&all=true#/common/topic Here's some of many historical email threads (filtered against "andi"): https://groups.google.com/forum/#!searchin/freebase-discuss/freebase$20andi$20scoring%7Csort:date https://groups.google.com/forum/#!searchin/freebase-discuss/freebase$20andi$20schema%7Csort:date https://groups.google.com/forum/#!searchin/freebase-discuss/freebase$20andi$20namespace%7Csort:date I miss it so so much :( :( Thad https://www.linkedin.com/in/thadguidry/ On Fri, Jun 21, 2019 at 5:48 AM Antonin Delpeuch <antonin@delpeuch.eu> wrote: > Thanks! So if I understand correctly these domains made it possible to > tweak the scoring mechanism in various areas of Freebase? > > How does this notion of "domains" relate to types? > > Antonin > On 6/21/19 2:30 AM, Thad Guidry wrote: > > Wonderful paper Antonin! Read it thoroughly. > > So my thoughts... and a bit of historical context from Freebase (that > never got implemented, but that you hinted on in your paper). > > Freebase had knowledge that was classified additionally in Domains. > Domains were a 1st class citizen and you could even limit MQL search > queries to a particular domain(s). > For instance, you had an term "car" and could see that there were multiple > Domains where the name "car" appeared in like Automobile, Transportation, > Trains, and Insects. > Domains areas were not shown or exposed for candidates. Neither was > scoring or subscoring within a Domain context ever implemented. > However it was something David and I considered adding but never did in > OpenRefine. > > Domains tend to have their own stopwords, abbreviation styles, and > syntaxes. Many datasets that are published actually acknowledge this such > as NCBI's PubMed[1], Eurovoc subs[2], etc. > So I also think that optional scores for abbreviations, include/exclude > stopwords, etc. could be useful for Facets to expose. Maybe even custom > Q-grams against Abbreviations is something to consider per Domain, but I > have no opinion or evidence on that. > > Essentially, variances on names, abbreviations, etc. are highly dependent > on Databases where oftentimes Domain trickery plays an important role and > only the user will know where to place that importance. > Giving the user all options on the table when it comes to scoring against > the Domains subdata becomes highly important. > > From an API perspective, it would make sense to have optionals for many > things. > I think in Wikidata API you have "notable" ? but regardless the hierarchy, > the additional info could look something like: > > [{ > "id": "m456", > "name": "AA", > "full_name": "Analogs and Derivatives ", > "domain": "medicine", > "abbrev_match_score": 100 > }, > { > "id": "o123", > "name": "AA", > "full_name": "Alcoholics Anonymous", > "domain": "organizations", > "abbrev_match_score": 95 > }] > > [1] - > https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.stopwords/?report=objectonly > > [1] - > https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.mesh_subheadings/?report=objectonly > [1] - PHS 2-character grant abbreviation, e.g., LM > [1] - institute acronym, e.g., NLM NIH HHS > [2] - Sorry, lost my old links for disambiguating Eurovoc vocabulary and > subdata > > Thad > https://www.linkedin.com/in/thadguidry/ > > > On Thu, Jun 20, 2019 at 11:14 AM Antonin Delpeuch <antonin@delpeuch.eu> > wrote: > >> Hi all, >> >> I have just made public a short survey of the existing reconciliation >> services and how their implementations relate to the techniques used in >> record linkage (the corresponding academic field): >> >> https://arxiv.org/abs/1906.08092 >> >> This is the outcome of a project done with OpenCorporates to plan >> improvements to their own reconciliation service. The survey is therefore >> focused on the perspective of the service provider, rather than the end >> user, although the goal is of course to make the service more useful to >> data consumers. I hope this can help start the discussions (and give some >> definitions, as Ricardo suggested). >> >> I have outlined a few suggestions, mostly around the issue of scoring, >> which reflect my general feeling about this API: at the moment service >> providers are a bit clueless about how to score reconciliation candidates, >> and as a consequence users cannot really rely on them in general. >> >> It would be great to also study what users actually do with these >> services, to better understand their workflows. I am not sure how to >> approach that, given that OpenRefine workflows are generally not >> published. One possibility would be to analyze the logs of existing >> reconciliation services (such as the one for Wikidata). Let me know if >> you are interested in that sort of project. >> >> Àny feedback is welcome of course! >> >> Cheers, >> >> Antonin >> >> >>
Received on Friday, 21 June 2019 12:36:17 UTC