Re: Survey of OpenRefine reconciliation services

In my experience, a user often needs to provide some "context" or global
parameter to reconciliation.
Eg local country (for place reconciliation), jurisdiction (for company
reconciliation), etc.

The current protocol doesn't support that.
So eg OpenCorporates recon takes jurisdiction as part of the base service
URL.
"Global parameters" is a promising extension of the protocol.

On Tue, Jun 25, 2019 at 5:30 PM Thad Guidry <thadguidry@gmail.com> wrote:

> Vladimir,
>
> Do you think that "importance" and "what is popular" should be
> controllable by the requesting user?
>
> What if my domain I'm working in is "Criminal activity" ... do you agree
> that the importance or popularity scoring can now be skewed and properly
> should be?
>
> *Thad Query*
> Subject (query item):  Rembrandt
> Domain (proposed enhancement): Crime
>
> Results:
> 90%  https://www.wikidata.org/wiki/Q2246489  -  The Storm on the Sea of
> Galilee
> 89%  https://www.wikidata.org/wiki/Q661378  - The Anatomy Lesson of Dr.
> Nicolaes Tulp
>
> 55%  https://www.wikidata.org/wiki/Q5598  - Rembrandt  (creator)
>
>
> *Vladimir Query*
> Subject (query item):  Rembrandt
> Domain (proposed enhancement):
>
> Results (with no Domain field specified):
> 95%  https://www.wikidata.org/wiki/Q5598  - Rembrandt (creator)
>
> 90%  https://www.wikidata.org/wiki/Q990960  - Rembrandt (city in Buena
> Vista County, Iowa, United States)
>
> Thad
> https://www.linkedin.com/in/thadguidry/
>
>
> On Tue, Jun 25, 2019 at 8:45 AM Vladimir Alexiev <
> vladimir.alexiev@ontotext.com> wrote:
>
>> Hi Antonin! Some feedback on the paper:
>>
>> - services can and should score even on the absence of fields, based on
>> the importance and popularity of entities. Eg a Paris or Rembrandt should
>> return the respective famous entities, not one of the hundred other
>> possible matches; inactive companies should (and are) downgraded, etc
>> - if a dataset has multiple labels per entity (e.g VIAF has over 100 for
>> Cranach), it should take label nature into account
>> - VIAF has life years (and if I'm not mistaken gender) , which are
>> important to use.
>> - It even has Occupation and Nationality, which however are part of the
>> label and not normalized. (Our service extracts them to separate fields)
>> - that H2020 project is not related to ERC
>>
>> On Thu, Jun 20, 2019, 18:14 Antonin Delpeuch <antonin@delpeuch.eu> wrote:
>>
>>> Hi all,
>>>
>>> I have just made public a short survey of the existing reconciliation
>>> services and how their implementations relate to the techniques used in
>>> record linkage (the corresponding academic field):
>>>
>>> https://arxiv.org/abs/1906.08092
>>>
>>> This is the outcome of a project done with OpenCorporates to plan
>>> improvements to their own reconciliation service. The survey is therefore
>>> focused on the perspective of the service provider, rather than the end
>>> user, although the goal is of course to make the service more useful to
>>> data consumers. I hope this can help start the discussions (and give some
>>> definitions, as Ricardo suggested).
>>>
>>> I have outlined a few suggestions, mostly around the issue of scoring,
>>> which reflect my general feeling about this API: at the moment service
>>> providers are a bit clueless about how to score reconciliation candidates,
>>> and as a consequence users cannot really rely on them in general.
>>>
>>> It would be great to also study what users actually do with these
>>> services, to better understand their workflows. I am not sure how to
>>> approach that, given that OpenRefine workflows are generally not
>>> published. One possibility would be to analyze the logs of existing
>>> reconciliation services (such as the one for Wikidata). Let me know if
>>> you are interested in that sort of project.
>>>
>>> Àny feedback is welcome of course!
>>>
>>> Cheers,
>>>
>>> Antonin
>>>
>>>
>>>

-- 
Vladimir Alexiev, PhD, PMP
Chief Data Architect
Sirma AI, trading as Ontotext: https://www.ontotext.com, LinkedIn
<https://www.linkedin.com/company-beta/208070>, Twitter
<https://twitter.com/ontotext>, Rate GraphDB
<http://www.capterra.com/database-management-software/reviews/157533/Graph%20DB/Ontotext/new>
Email: vladimir.alexiev@ontotext.com, skype:valexiev1
Mobile: +359 888 568 132, SMS: 359888568132@sms.mtel.net
Calendar:
https://www.google.com/calendar/embed?src=vladimir.alexiev@ontotext.com
Publications and CV: https://github.com/VladimirAlexiev/my

Received on Wednesday, 26 June 2019 08:57:18 UTC