Re: Survey of OpenRefine reconciliation services from Antonin Delpeuch on 2019-06-29 (public-reconciliation@w3.org from June 2019)

From: Antonin Delpeuch <antonin@delpeuch.eu>
Date: Sat, 29 Jun 2019 10:01:44 +0200
To: public-reconciliation@w3.org
Message-ID: <7b82ad62-ab01-1be3-1873-75de6e1ed2f3@delpeuch.eu>
Hi Vladimir,

On 6/26/19 10:56 AM, Vladimir Alexiev wrote:
> In my experience, a user often needs to provide some "context" or
> global parameter to reconciliation.
> Eg local country (for place reconciliation), jurisdiction (for company
> reconciliation), etc.
>
> The current protocol doesn't support that.
> So eg OpenCorporates recon takes jurisdiction as part of the base
> service URL.
> "Global parameters" is a promising extension of the protocol.

I am not sure what you mean here: the protocol does support adding
properties to give some context to each reconciliation query.

OpenRefine's interface does not make it easy to add such global
properties: you have to create a new column containing the constant
property value. But in my opinion this is more a limitation of
OpenRefine rather than the protocol, I would say. For instance the
jurisdiction that OpenCorporates accepts in the base service URL can
also be specified as a property, which gives the same results.

Or did I misunderstand your suggestion?

Antonin

>
> On Tue, Jun 25, 2019 at 5:30 PM Thad Guidry <thadguidry@gmail.com
> <mailto:thadguidry@gmail.com>> wrote:
>
>     Vladimir,
>
>     Do you think that "importance" and "what is popular" should be
>     controllable by the requesting user?
>
>     What if my domain I'm working in is "Criminal activity" ... do you
>     agree that the importance or popularity scoring can now be skewed
>     and properly should be?
>
>     *Thad Query*
>     Subject (query item):  Rembrandt
>     Domain (proposed enhancement): Crime
>
>         Results:
>         90%  https://www.wikidata.org/wiki/Q2246489  -  The Storm on
>         the Sea of Galilee
>         89%  https://www.wikidata.org/wiki/Q661378  - The Anatomy
>         Lesson of Dr. Nicolaes Tulp
>
>         55%  https://www.wikidata.org/wiki/Q5598  - Rembrandt  (creator)
>
>
>     *Vladimir Query*
>     Subject (query item):  Rembrandt
>     Domain (proposed enhancement): 
>
>         Results (with no Domain field specified):
>         95%  https://www.wikidata.org/wiki/Q5598  - Rembrandt (creator)
>
>         90%  https://www.wikidata.org/wiki/Q990960  - Rembrandt (city
>         in Buena Vista County, Iowa, United States)
>
>     Thad
>     https://www.linkedin.com/in/thadguidry/
>
>
>     On Tue, Jun 25, 2019 at 8:45 AM Vladimir Alexiev
>     <vladimir.alexiev@ontotext.com
>     <mailto:vladimir.alexiev@ontotext.com>> wrote:
>
>         Hi Antonin! Some feedback on the paper: 
>
>         - services can and should score even on the absence of fields,
>         based on the importance and popularity of entities. Eg a Paris
>         or Rembrandt should return the respective famous entities, not
>         one of the hundred other possible matches; inactive companies
>         should (and are) downgraded, etc
>         - if a dataset has multiple labels per entity (e.g VIAF has
>         over 100 for Cranach), it should take label nature into account
>         - VIAF has life years (and if I'm not mistaken gender) , which
>         are important to use.
>         - It even has Occupation and Nationality, which however are
>         part of the label and not normalized. (Our service extracts
>         them to separate fields) 
>         - that H2020 project is not related to ERC 
>
>         On Thu, Jun 20, 2019, 18:14 Antonin Delpeuch
>         <antonin@delpeuch.eu <mailto:antonin@delpeuch.eu>> wrote:
>
>             Hi all,
>
>             I have just made public a short survey of the existing
>             reconciliation services and how their implementations
>             relate to the techniques used in record linkage (the
>             corresponding academic field):
>
>             https://arxiv.org/abs/1906.08092
>
>             This is the outcome of a project done with OpenCorporates
>             to plan improvements to their own reconciliation service.
>             The survey is therefore focused on the perspective of the
>             service provider, rather than the end user, although the
>             goal is of course to make the service more useful to data
>             consumers. I hope this can help start the discussions (and
>             give some definitions, as Ricardo suggested).
>
>             I have outlined a few suggestions, mostly around the issue
>             of scoring, which reflect my general feeling about this
>             API: at the moment service providers are a bit clueless
>             about how to score reconciliation candidates, and as a
>             consequence users cannot really rely on them in general.
>
>             It would be great to also study what users actually do
>             with these services, to better understand their workflows.
>             I am not sure how to
>             approach that, given that OpenRefine workflows are
>             generally not published. One possibility would be to
>             analyze the logs of existing
>             reconciliation services (such as the one for Wikidata).
>             Let me know if you are interested in that sort of project.
>
>             Àny feedback is welcome of course!
>
>             Cheers,
>
>             Antonin
>
>
>
>
> -- 
> Vladimir Alexiev, PhD, PMP
> Chief Data Architect
> Sirma AI, trading as Ontotext: https://www.ontotext.com
> <https://www.ontotext.com/>, LinkedIn
> <https://www.linkedin.com/company-beta/208070>, Twitter
> <https://twitter.com/ontotext>, Rate GraphDB
> <http://www.capterra.com/database-management-software/reviews/157533/Graph%20DB/Ontotext/new>
> Email: vladimir.alexiev@ontotext.com
> <mailto:vladimir.alexiev@ontotext.com>, skype:valexiev1
> Mobile: +359 888 568 132, SMS: 359888568132@sms.mtel.net
> <mailto:359888568132@sms.mtel.net>
> Calendar:
> https://www.google.com/calendar/embed?src=vladimir.alexiev@ontotext.com
> Publications and CV: https://github.com/VladimirAlexiev/my
Received on Saturday, 29 June 2019 08:02:10 UTC