Re: A server using the DeDupe library/exploring ML tuning weights from Thad Guidry on 2021-06-01 (public-reconciliation@w3.org from June 2021)

From: Thad Guidry <thadguidry@gmail.com>
Date: Tue, 1 Jun 2021 09:22:58 -0500
To: Erik Paulson <epaulson@unit1127.com>
Cc: public-reconciliation@w3.org
Message-ID: <CAChbWaN-BWxZrrNTyomS1z+Yh7WLQU4mTYi5YNhUMT+Mnposfg@mail.gmail.com>
Erik,

Regarding security, I don't think we need anything directly in the API
itself other than supporting authentication as we've talked about in the
past in issue #26 <https://github.com/reconciliation-api/specs/issues/26>.
The reasoning would be if folks want to have secure sessions than that
could still be done over encrypted sessions via HTTPS, TLS, QUIC, etc.

Thanks so much for exploring the topic of ML tuning weights!
I do think there should be the ability to have encrypted authenticated
sessions (through whatever encrypted security protocol layer a
learning/recon service chooses to have with clients).

Have you thought about what is still missing in the API if anything to
support encrypted authenticated sessions?

Thad
https://www.linkedin.com/in/thadguidry/
https://calendly.com/thadguidry/


On Tue, Jun 1, 2021 at 12:22 AM Erik Paulson <epaulson@unit1127.com> wrote:

> I wanted to explore the ML tuning weights issue discussed here:
> https://github.com/reconciliation-api/specs/issues/30
>
> specifically, by adding a "session" to the reconciliation API protocol. I
> built a very basic version here:
> https://github.com/epaulson/reconciliation-dedupe-test
>
> It's just a strawman to get a discussion started, if there's any interest
> in going this direction. The server supports clients creating persistent
> "sessions" that are unique to that client instance, and uploading
> additional training data for that session (which is not all that different
> from sending feedback about match candidates selected.) If a client sends a
> reconcile request with that session ID, the server finds the latest
> training settings for that session and reconciles using those weights. If
> there is no session ID in the reconcile request, the server falls back and
> uses the default weights for the matching. The client can upload additional
> training data for a session and the server will retrain the weights used
> for that session and use them for future /reconcile requests for that
> session.
>
> There's a bunch more to think about in the protocol about security, error
> handling and what states to represent/expose while the server is retraining
> based on the new examples, etc, but this was a start.
>
> (I know OpenRefine doesn't support anything like a reconciliation session
> yet, and you'd have to have a workflow like "star a bunch of examples that
> are pre-matched" so the reconcile action could know what to upload before
> sending candidates up, etc. For now I just tested with a basic python test
> script)
>
> Semi-related, I wanted to find something to test with that actually
> supported training and providing feedback/retraining, so I used the DeDupe
> Python library:
> https://github.com/dedupeio/
>
> My example server is a little wonky, in part because this was v0.01 and I
> just wanted to get something done, and I've got some questions out to the
> DeDupe group on some API usage to try to understand how to better support a
> reconciliation-like workflow, where you might not have many training
> examples at the start to tune your weights.
>
> It's probably not that far from being somewhat general purpose, and
> customizable on a per-dataset basis by using the dedupe gazetteer-example
> tool to train on a few examples to create a generic settings file and
> editing the 'fields' dictionary in the scripts.
>
> The nice thing is the team behind DeDupe is really, really good at entity
> matching and has a great library (and a nice SaaS offering if you want to
> match or deduplicate your datasets), so DeDupe library could be very useful
> to other reconciliation API service implementers. DeDupe supports matching
> over multiple columns (in fact that's the usual usecase) so supporting
> additional properties in a query is straightforward for the library (though
> I have it turned off in my v0.01 server)
>
> I looked at using gitonthescene's csv-reconcile to build from - that
> server supports plugins, but the plugins only support scoring actual pairs,
> and DeDupe internally handles pair generation and does its own blocking so
> it doesn't consider most candidate pairs, so DeDupe is probably not a good
> fit for csv-reconcile.
>
> I'd love to talk more about getting training weight/example matches into
> the reconciliation service protocol, and to talk more about using DeDupe as
> a backend for reconciliation API servers.
>
> Thanks,
>
> -Erik
>
Received on Tuesday, 1 June 2021 14:23:50 UTC