Re: Survey of OpenRefine reconciliation services from Markus Mandalka on 2019-06-26 (public-reconciliation@w3.org from June 2019)

From: Markus Mandalka <w3c@mandalka.name>
Date: Wed, 26 Jun 2019 15:03:10 +0200
To: public-reconciliation@w3.org
Message-ID: <f194643c-cdad-237c-d918-14069795f74e@mandalka.name>
Hello,


Am 20.06.2019 um 18:13 schrieb Antonin Delpeuch:
> This is the outcome of a project done with OpenCorporates to plan improvements to their own reconciliation service. The survey is therefore focused on the perspective of the service provider, rather than the end user, although the goal is of course to make the service more useful to data consumers. I hope this can help start the discussions (and give some definitions, as Ricardo suggested).
>
> I have outlined a few suggestions, mostly around the issue of scoring, which reflect my general feeling about this API: at the moment service providers are a bit clueless about how to score reconciliation candidates, and as a consequence users cannot really rely on them in general.
>
> It would be great to also study what users actually do with these services, to better understand their workflows. I am not sure how to
> approach that, given that OpenRefine workflows are generally not published. One possibility would be to analyze the logs of existing
> reconciliation services (such as the one for Wikidata). Let me know if you are interested in that sort of project.
>

Some of my first experiences from some past, current and coming
usage/implementations based on Openrefine reconciliation API standard in
some of my project(s) (mainly very domainspecific entities) in
investigative journalism, very specialized scientific research projects,
some lodlam.

First: sorry for bad english and maybe wrong details, that maybe are yet
specified but used by me wrong or not yet (litle time these days for
cleaner post and not using the full/standard specification (yet) for all
things, but yet as inspiration and for aggregation of very custom mix of
different APIs/Methods, but it works for me and was a good starting
point for my use cases and focus was on a running system, not full
standard compatibilty).

I use some parts of the standard for named entity linking of named
entities in fulltext to tag unstrucured text / documents by entity IDs
from domainspecific SKOS thesaurus and
Knowledgebase/Ontologies/DBs/lists with entities for further graph
analysis and faceted search. Some first code is published at
https://github.com/opensemanticsearch/open-semantic-entity-search-api
and other repositories - Implementing more matching & scoring methods
and an UI for semiautomatic Entity disambiguation / Semantic tagging
recommender for Documents will be next steps.

- Like mentioned in some mails from others before (i for example often
investigate/research entities that are not the most popular (or in
context of crime better:) or (yet) best known / (yet) most occuring (but
same names), too, so context beside entity type is important (see
"Context(s)"). If i understand right, therefore there is yet a specified
general query parameter "properties".

- Like yet mentioned in some emails from others before and/or the paper
a global score for an entity for me often is not enough for needed or
helpfull transparency of results and better UIs which helps users by
returning more granular (and based on the queries context) weights /
scores for/and explaination(s) of multiple different scoring weights /
methods than a global score (See "More granular scoring / explainations
/ signs").


Context(s)

I need to query or rank by contexts / (sub)domains (for filtering
and/or/xor ranking) like

- IDs of other entities or subjects of the context (like
 -yet disambiguated entity IDs
 -or SKOS Thesaurus Topic IDs
in this context after/from disambiguation or semantic tagging by humans)

- Text like even full text of an analyzed document/report/news.


Pass/use additional Parameters

- Pass/use custom query Parameters (often very specific / stack related)
for each entity like different Entity parameters (f.e. language or
location) or Solr parameters (f.e. enabling/disable stemming, fuzzy
search by levensthein or custom field weights) which is not a problem,
since in my custom implementation i can read additional HTTP and / or
Post parameters beside the Openrefine queries or embed them in the query
array, if i understand right, therefore there is the query Parameter
"properties"), since much more powerfull than the OpenRefine queries
only with specified parameters.


Return fields / values

More granular Scoring / explainations / signs

in my custom Reconciliation API-results i plan not use only an single
score for each entity but to explain more to be more transparent, so
users/UIs can use the analysis data and understand better why a
recommendation / for easier decisions for helping the users in
disambiguation by showing & explaining them signs/indications/hints
(dynamic, based on their queried context(s)) in coming
recommender/disambiguation UI for an Entity Tagger / Entity Linking /
Approve UI for (semi)automatic tagging for documents i work on for Open
Semantic Search.

Examples of different methods where i need to get more infos/more
granular score about signs/indications:

like for example

- TF/IDF
For example TF/IDF based results like Solr or Elastic Search (so user
can easier understand/read/judge TF/IDF scores like in
https://opensemanticsearch.org/solr-relevance-ranking-analysis)


- Graph database / Knowledgebase
f.e. weights by graph queries from candidate to context (f.e. how
strong/weights/count of hops of (direct and indirect) connections in a
graph database)


- Cooccurence of Entities

How similar the posted (con)text is to other texts connected/taged by
yet human disambiguated or not ambiguent entity/entities)

- ML
classification by ML models

- classification by text vectors / text similarity

For most of this standard frameworks with their APIs (which i dont want
to reinvent but integrate) are/will be used.

That mix of (current and planed) methods is why i use a OpenRefine API
specification inspired API (not to reinvent each wheel, even if not
using OpenRefine client) to mix/aggregate weights/scores of different
matching/ranking/similarity/distance measure/statistic methods and
context, since (if not using OpenRefine as client) for only
(fuzzy)search for IDs of given entity labels and for scoring existing
APIs like Solr or Elasticsearch are much more powerful.


Dynamic result fields?

Since such analysis are often dependent to queried context but
additional very special / dependent on the used stack and method (for
example TF/IDF explaination of Solr or Elastic Search) and can be
scores, complex structures or a subgraph of connected entities , i'm not
sure if the more general OpenRefine specification is the best place, but
for me returning dynamic / custom fields/values (like even complex JSON
objects) for explaining hints/leads/weights/scores/paths as base for
easier human disambiguation decisions is important, which are dynamic to
the query & context, not only static like additional fields of the
entity record.

Since in my implementation i return fields beside the specified "id",
"name" and "type" in each "result" array and all this is very custom, i
can manage static / dynamic fields based on property/field names in my
custom implementations, there's no problem without an additional
separation of dynamic and static fields/result variables.

Best regards
Markus
Received on Wednesday, 26 June 2019 13:24:42 UTC