A server using the DeDupe library/exploring ML tuning weights from Erik Paulson on 2021-05-28 (public-reconciliation@w3.org from June 2021)

From: Erik Paulson <epaulson@unit1127.com>
Date: Fri, 28 May 2021 18:34:06 -0500
To: public-reconciliation@w3.org
Message-ID: <CAKJO4n4VqxMc0De90-Qj41txCVh8KVK1dfDnNEfQZFgLG_dW0g@mail.gmail.com>
I wanted to explore the ML tuning weights issue discussed here:
https://github.com/reconciliation-api/specs/issues/30

specifically, by adding a "session" to the reconciliation API protocol. I
built a very basic version here:
https://github.com/epaulson/reconciliation-dedupe-test

It's just a strawman to get a discussion started, if there's any interest
in going this direction. The server supports clients creating persistent
"sessions" that are unique to that client instance, and uploading
additional training data for that session (which is not all that different
from sending feedback about match candidates selected.) If a client sends a
reconcile request with that session ID, the server finds the latest
training settings for that session and reconciles using those weights. If
there is no session ID in the reconcile request, the server falls back and
uses the default weights for the matching. The client can upload additional
training data for a session and the server will retrain the weights used
for that session and use them for future /reconcile requests for that
session.

There's a bunch more to think about in the protocol about security, error
handling and what states to represent/expose while the server is retraining
based on the new examples, etc, but this was a start.

(I know OpenRefine doesn't support anything like a reconciliation session
yet, and you'd have to have a workflow like "star a bunch of examples that
are pre-matched" so the reconcile action could know what to upload before
sending candidates up, etc. For now I just tested with a basic python test
script)

Semi-related, I wanted to find something to test with that actually
supported training and providing feedback/retraining, so I used the DeDupe
Python library:
https://github.com/dedupeio/

My example server is a little wonky, in part because this was v0.01 and I
just wanted to get something done, and I've got some questions out to the
DeDupe group on some API usage to try to understand how to better support a
reconciliation-like workflow, where you might not have many training
examples at the start to tune your weights.

It's probably not that far from being somewhat general purpose, and
customizable on a per-dataset basis by using the dedupe gazetteer-example
tool to train on a few examples to create a generic settings file and
editing the 'fields' dictionary in the scripts.

The nice thing is the team behind DeDupe is really, really good at entity
matching and has a great library (and a nice SaaS offering if you want to
match or deduplicate your datasets), so DeDupe library could be very useful
to other reconciliation API service implementers. DeDupe supports matching
over multiple columns (in fact that's the usual usecase) so supporting
additional properties in a query is straightforward for the library (though
I have it turned off in my v0.01 server)

I looked at using gitonthescene's csv-reconcile to build from - that server
supports plugins, but the plugins only support scoring actual pairs, and
DeDupe internally handles pair generation and does its own blocking so it
doesn't consider most candidate pairs, so DeDupe is probably not a good fit
for csv-reconcile.

I'd love to talk more about getting training weight/example matches into
the reconciliation service protocol, and to talk more about using DeDupe as
a backend for reconciliation API servers.

Thanks,

-Erik
Received on Tuesday, 1 June 2021 05:21:48 UTC