- From: Erik Paulson <epaulson@unit1127.com>
- Date: Fri, 28 May 2021 18:34:06 -0500
- To: public-reconciliation@w3.org
- Message-ID: <CAKJO4n4VqxMc0De90-Qj41txCVh8KVK1dfDnNEfQZFgLG_dW0g@mail.gmail.com>
I wanted to explore the ML tuning weights issue discussed here: https://github.com/reconciliation-api/specs/issues/30 specifically, by adding a "session" to the reconciliation API protocol. I built a very basic version here: https://github.com/epaulson/reconciliation-dedupe-test It's just a strawman to get a discussion started, if there's any interest in going this direction. The server supports clients creating persistent "sessions" that are unique to that client instance, and uploading additional training data for that session (which is not all that different from sending feedback about match candidates selected.) If a client sends a reconcile request with that session ID, the server finds the latest training settings for that session and reconciles using those weights. If there is no session ID in the reconcile request, the server falls back and uses the default weights for the matching. The client can upload additional training data for a session and the server will retrain the weights used for that session and use them for future /reconcile requests for that session. There's a bunch more to think about in the protocol about security, error handling and what states to represent/expose while the server is retraining based on the new examples, etc, but this was a start. (I know OpenRefine doesn't support anything like a reconciliation session yet, and you'd have to have a workflow like "star a bunch of examples that are pre-matched" so the reconcile action could know what to upload before sending candidates up, etc. For now I just tested with a basic python test script) Semi-related, I wanted to find something to test with that actually supported training and providing feedback/retraining, so I used the DeDupe Python library: https://github.com/dedupeio/ My example server is a little wonky, in part because this was v0.01 and I just wanted to get something done, and I've got some questions out to the DeDupe group on some API usage to try to understand how to better support a reconciliation-like workflow, where you might not have many training examples at the start to tune your weights. It's probably not that far from being somewhat general purpose, and customizable on a per-dataset basis by using the dedupe gazetteer-example tool to train on a few examples to create a generic settings file and editing the 'fields' dictionary in the scripts. The nice thing is the team behind DeDupe is really, really good at entity matching and has a great library (and a nice SaaS offering if you want to match or deduplicate your datasets), so DeDupe library could be very useful to other reconciliation API service implementers. DeDupe supports matching over multiple columns (in fact that's the usual usecase) so supporting additional properties in a query is straightforward for the library (though I have it turned off in my v0.01 server) I looked at using gitonthescene's csv-reconcile to build from - that server supports plugins, but the plugins only support scoring actual pairs, and DeDupe internally handles pair generation and does its own blocking so it doesn't consider most candidate pairs, so DeDupe is probably not a good fit for csv-reconcile. I'd love to talk more about getting training weight/example matches into the reconciliation service protocol, and to talk more about using DeDupe as a backend for reconciliation API servers. Thanks, -Erik
Received on Tuesday, 1 June 2021 05:21:48 UTC