Re: Survey of OpenRefine reconciliation services from Alan Buxton on 2019-06-26 (public-reconciliation@w3.org from June 2019)

From: Alan Buxton <alan.buxton@opencorporates.com>
Date: Wed, 26 Jun 2019 14:57:42 +0200
To: Vladimir Alexiev <vladimir.alexiev@ontotext.com>
Cc: Thad Guidry <thadguidry@gmail.com>, public-reconciliation@w3.org
Message-ID: <CAEBk1NtH0Ec4Xr8Jac38r2K7kMudapaVoWid=miPU5kT4sogeA@mail.gmail.com>
This is a really interesting discussion.

I like the idea of the user providing some kind of input about how they
want to score things. Different domains will have different views on
importance/relevance.

On Wed, 26 Jun 2019, 10:57 Vladimir Alexiev, <vladimir.alexiev@ontotext.com>
wrote:

> In my experience, a user often needs to provide some "context" or global
> parameter to reconciliation.
> Eg local country (for place reconciliation), jurisdiction (for company
> reconciliation), etc.
>
> The current protocol doesn't support that.
> So eg OpenCorporates recon takes jurisdiction as part of the base service
> URL.
> "Global parameters" is a promising extension of the protocol.
>
> On Tue, Jun 25, 2019 at 5:30 PM Thad Guidry <thadguidry@gmail.com> wrote:
>
>> Vladimir,
>>
>> Do you think that "importance" and "what is popular" should be
>> controllable by the requesting user?
>>
>> What if my domain I'm working in is "Criminal activity" ... do you agree
>> that the importance or popularity scoring can now be skewed and properly
>> should be?
>>
>> *Thad Query*
>> Subject (query item):  Rembrandt
>> Domain (proposed enhancement): Crime
>>
>> Results:
>> 90%  https://www.wikidata.org/wiki/Q2246489  -  The Storm on the Sea of
>> Galilee
>> 89%  https://www.wikidata.org/wiki/Q661378  - The Anatomy Lesson of Dr.
>> Nicolaes Tulp
>>
>> 55%  https://www.wikidata.org/wiki/Q5598  - Rembrandt  (creator)
>>
>>
>> *Vladimir Query*
>> Subject (query item):  Rembrandt
>> Domain (proposed enhancement):
>>
>> Results (with no Domain field specified):
>> 95%  https://www.wikidata.org/wiki/Q5598  - Rembrandt (creator)
>>
>> 90%  https://www.wikidata.org/wiki/Q990960  - Rembrandt (city in Buena
>> Vista County, Iowa, United States)
>>
>> Thad
>> https://www.linkedin.com/in/thadguidry/
>>
>>
>> On Tue, Jun 25, 2019 at 8:45 AM Vladimir Alexiev <
>> vladimir.alexiev@ontotext.com> wrote:
>>
>>> Hi Antonin! Some feedback on the paper:
>>>
>>> - services can and should score even on the absence of fields, based on
>>> the importance and popularity of entities. Eg a Paris or Rembrandt should
>>> return the respective famous entities, not one of the hundred other
>>> possible matches; inactive companies should (and are) downgraded, etc
>>> - if a dataset has multiple labels per entity (e.g VIAF has over 100 for
>>> Cranach), it should take label nature into account
>>> - VIAF has life years (and if I'm not mistaken gender) , which are
>>> important to use.
>>> - It even has Occupation and Nationality, which however are part of the
>>> label and not normalized. (Our service extracts them to separate fields)
>>> - that H2020 project is not related to ERC
>>>
>>> On Thu, Jun 20, 2019, 18:14 Antonin Delpeuch <antonin@delpeuch.eu>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have just made public a short survey of the existing reconciliation
>>>> services and how their implementations relate to the techniques used in
>>>> record linkage (the corresponding academic field):
>>>>
>>>> https://arxiv.org/abs/1906.08092
>>>>
>>>> This is the outcome of a project done with OpenCorporates to plan
>>>> improvements to their own reconciliation service. The survey is therefore
>>>> focused on the perspective of the service provider, rather than the end
>>>> user, although the goal is of course to make the service more useful to
>>>> data consumers. I hope this can help start the discussions (and give some
>>>> definitions, as Ricardo suggested).
>>>>
>>>> I have outlined a few suggestions, mostly around the issue of scoring,
>>>> which reflect my general feeling about this API: at the moment service
>>>> providers are a bit clueless about how to score reconciliation candidates,
>>>> and as a consequence users cannot really rely on them in general.
>>>>
>>>> It would be great to also study what users actually do with these
>>>> services, to better understand their workflows. I am not sure how to
>>>> approach that, given that OpenRefine workflows are generally not
>>>> published. One possibility would be to analyze the logs of existing
>>>> reconciliation services (such as the one for Wikidata). Let me know if
>>>> you are interested in that sort of project.
>>>>
>>>> Àny feedback is welcome of course!
>>>>
>>>> Cheers,
>>>>
>>>> Antonin
>>>>
>>>>
>>>>
>
> --
> Vladimir Alexiev, PhD, PMP
> Chief Data Architect
> Sirma AI, trading as Ontotext: https://www.ontotext.com, LinkedIn
> <https://www.linkedin.com/company-beta/208070>, Twitter
> <https://twitter.com/ontotext>, Rate GraphDB
> <http://www.capterra.com/database-management-software/reviews/157533/Graph%20DB/Ontotext/new>
> Email: vladimir.alexiev@ontotext.com, skype:valexiev1
> Mobile: +359 888 568 132, SMS: 359888568132@sms.mtel.net
> Calendar:
> https://www.google.com/calendar/embed?src=vladimir.alexiev@ontotext.com
> Publications and CV: https://github.com/VladimirAlexiev/my
>
Received on Wednesday, 26 June 2019 12:59:04 UTC