- From: Markus Mandalka <w3c@mandalka.name>
- Date: Wed, 26 Jun 2019 15:03:10 +0200
- To: public-reconciliation@w3.org
Hello, Am 20.06.2019 um 18:13 schrieb Antonin Delpeuch: > This is the outcome of a project done with OpenCorporates to plan improvements to their own reconciliation service. The survey is therefore focused on the perspective of the service provider, rather than the end user, although the goal is of course to make the service more useful to data consumers. I hope this can help start the discussions (and give some definitions, as Ricardo suggested). > > I have outlined a few suggestions, mostly around the issue of scoring, which reflect my general feeling about this API: at the moment service providers are a bit clueless about how to score reconciliation candidates, and as a consequence users cannot really rely on them in general. > > It would be great to also study what users actually do with these services, to better understand their workflows. I am not sure how to > approach that, given that OpenRefine workflows are generally not published. One possibility would be to analyze the logs of existing > reconciliation services (such as the one for Wikidata). Let me know if you are interested in that sort of project. > Some of my first experiences from some past, current and coming usage/implementations based on Openrefine reconciliation API standard in some of my project(s) (mainly very domainspecific entities) in investigative journalism, very specialized scientific research projects, some lodlam. First: sorry for bad english and maybe wrong details, that maybe are yet specified but used by me wrong or not yet (litle time these days for cleaner post and not using the full/standard specification (yet) for all things, but yet as inspiration and for aggregation of very custom mix of different APIs/Methods, but it works for me and was a good starting point for my use cases and focus was on a running system, not full standard compatibilty). I use some parts of the standard for named entity linking of named entities in fulltext to tag unstrucured text / documents by entity IDs from domainspecific SKOS thesaurus and Knowledgebase/Ontologies/DBs/lists with entities for further graph analysis and faceted search. Some first code is published at https://github.com/opensemanticsearch/open-semantic-entity-search-api and other repositories - Implementing more matching & scoring methods and an UI for semiautomatic Entity disambiguation / Semantic tagging recommender for Documents will be next steps. - Like mentioned in some mails from others before (i for example often investigate/research entities that are not the most popular (or in context of crime better:) or (yet) best known / (yet) most occuring (but same names), too, so context beside entity type is important (see "Context(s)"). If i understand right, therefore there is yet a specified general query parameter "properties". - Like yet mentioned in some emails from others before and/or the paper a global score for an entity for me often is not enough for needed or helpfull transparency of results and better UIs which helps users by returning more granular (and based on the queries context) weights / scores for/and explaination(s) of multiple different scoring weights / methods than a global score (See "More granular scoring / explainations / signs"). Context(s) I need to query or rank by contexts / (sub)domains (for filtering and/or/xor ranking) like - IDs of other entities or subjects of the context (like -yet disambiguated entity IDs -or SKOS Thesaurus Topic IDs in this context after/from disambiguation or semantic tagging by humans) - Text like even full text of an analyzed document/report/news. Pass/use additional Parameters - Pass/use custom query Parameters (often very specific / stack related) for each entity like different Entity parameters (f.e. language or location) or Solr parameters (f.e. enabling/disable stemming, fuzzy search by levensthein or custom field weights) which is not a problem, since in my custom implementation i can read additional HTTP and / or Post parameters beside the Openrefine queries or embed them in the query array, if i understand right, therefore there is the query Parameter "properties"), since much more powerfull than the OpenRefine queries only with specified parameters. Return fields / values More granular Scoring / explainations / signs in my custom Reconciliation API-results i plan not use only an single score for each entity but to explain more to be more transparent, so users/UIs can use the analysis data and understand better why a recommendation / for easier decisions for helping the users in disambiguation by showing & explaining them signs/indications/hints (dynamic, based on their queried context(s)) in coming recommender/disambiguation UI for an Entity Tagger / Entity Linking / Approve UI for (semi)automatic tagging for documents i work on for Open Semantic Search. Examples of different methods where i need to get more infos/more granular score about signs/indications: like for example - TF/IDF For example TF/IDF based results like Solr or Elastic Search (so user can easier understand/read/judge TF/IDF scores like in https://opensemanticsearch.org/solr-relevance-ranking-analysis) - Graph database / Knowledgebase f.e. weights by graph queries from candidate to context (f.e. how strong/weights/count of hops of (direct and indirect) connections in a graph database) - Cooccurence of Entities How similar the posted (con)text is to other texts connected/taged by yet human disambiguated or not ambiguent entity/entities) - ML classification by ML models - classification by text vectors / text similarity For most of this standard frameworks with their APIs (which i dont want to reinvent but integrate) are/will be used. That mix of (current and planed) methods is why i use a OpenRefine API specification inspired API (not to reinvent each wheel, even if not using OpenRefine client) to mix/aggregate weights/scores of different matching/ranking/similarity/distance measure/statistic methods and context, since (if not using OpenRefine as client) for only (fuzzy)search for IDs of given entity labels and for scoring existing APIs like Solr or Elasticsearch are much more powerful. Dynamic result fields? Since such analysis are often dependent to queried context but additional very special / dependent on the used stack and method (for example TF/IDF explaination of Solr or Elastic Search) and can be scores, complex structures or a subgraph of connected entities , i'm not sure if the more general OpenRefine specification is the best place, but for me returning dynamic / custom fields/values (like even complex JSON objects) for explaining hints/leads/weights/scores/paths as base for easier human disambiguation decisions is important, which are dynamic to the query & context, not only static like additional fields of the entity record. Since in my implementation i return fields beside the specified "id", "name" and "type" in each "result" array and all this is very custom, i can manage static / dynamic fields based on property/field names in my custom implementations, there's no problem without an additional separation of dynamic and static fields/result variables. Best regards Markus
Received on Wednesday, 26 June 2019 13:24:42 UTC