Re: A server using the DeDupe library/exploring ML tuning weights

Hi Erik,

Thank you so much for investigating this!

About having the discussion here or on the mailing list, I think it's
great to have a thread about it on the mailing list. We have links in
both directions so it should not be too hard for people to find the
information they need.

I guess I should now find the time to make a similar prototype to
demonstrate how I would see matching features being used for a
client-side ML workflow (without adding any state to the server).

Best,

Antonin

On 03/06/2021 00:33, Erik Paulson wrote:
> Vladimir: I commented on that issue a few weeks back, and did
> reference it in my email. I always intended to put a pointer in that
> issue to this email, however, because that was my first post to the
> mailing list it took a couple of days before my email hit the archives
> so I couldn't get to it right after posting. Also, much of the email
> isn't germane to the issue being discussed - using DeDupe as the
> library for matching isn't really part of the protocol extension.
> However, there's now a link in the discussion to my message on the
> list. I'm happy to take the discussion to Github or to leave it here,
> whatever the community norms are for discussions I'm happy to follow. 
>
> Thad: I agree that many of the security/privacy issues will sort
> themselves out depending on how #26 goes. I'm not sure that I'd want
> to say that the "session" used in authentication should be reused for
> the matching. For example, say that we're using API keys, and I have
> one API key for a reconciliation API server. I may still want two
> different sessions for two different datasets - if I'm trying to
> reconcile a dataset of films against the knowledge base, and maybe in
> a different OR project I'm reconciling sports venues. I might be using
> the same API key for both, but I'd want to train two different
> matchers, one for films and one for sports venues, so under my setup,
> I'd have those as two different sessions.
>
> Again, happy to pick this up back on Github or to continue this here,
> or to split between them. Whatever folks would normally do!
>
> -Erik
>   
>
> On Tue, Jun 1, 2021 at 9:23 AM Thad Guidry <thadguidry@gmail.com
> <mailto:thadguidry@gmail.com>> wrote:
>
>     Erik,
>
>     Regarding security, I don't think we need anything directly in the
>     API itself other than supporting authentication as we've talked
>     about in the past in issue #26
>     <https://github.com/reconciliation-api/specs/issues/26>.
>     The reasoning would be if folks want to have secure sessions than
>     that could still be done over encrypted sessions via HTTPS, TLS,
>     QUIC, etc.
>
>     Thanks so much for exploring the topic of ML tuning weights!
>     I do think there should be the ability to have encrypted
>     authenticated sessions (through whatever encrypted security
>     protocol layer a learning/recon service chooses to have with clients).
>
>     Have you thought about what is still missing in the API if
>     anything to support encrypted authenticated sessions?
>
>     Thad
>     https://www.linkedin.com/in/thadguidry/
>     <https://www.linkedin.com/in/thadguidry/>
>     https://calendly.com/thadguidry/ <https://calendly.com/thadguidry/>
>
>
>     On Tue, Jun 1, 2021 at 12:22 AM Erik Paulson
>     <epaulson@unit1127.com <mailto:epaulson@unit1127.com>> wrote:
>
>         I wanted to explore the ML tuning weights issue discussed here:
>         https://github.com/reconciliation-api/specs/issues/30
>         <https://github.com/reconciliation-api/specs/issues/30>
>
>         specifically, by adding a "session" to the reconciliation API
>         protocol. I built a very basic version here:
>         https://github.com/epaulson/reconciliation-dedupe-test
>         <https://github.com/epaulson/reconciliation-dedupe-test>
>
>         It's just a strawman to get a discussion started, if there's
>         any interest in going this direction. The server supports
>         clients creating persistent "sessions" that are unique to that
>         client instance, and uploading additional training data for
>         that session (which is not all that different from sending
>         feedback about match candidates selected.) If a client sends a
>         reconcile request with that session ID, the server finds the
>         latest training settings for that session and reconciles using
>         those weights. If there is no session ID in the reconcile
>         request, the server falls back and uses the default weights
>         for the matching. The client can upload additional training
>         data for a session and the server will retrain the weights
>         used for that session and use them for future /reconcile
>         requests for that session. 
>
>         There's a bunch more to think about in the protocol about
>         security, error handling and what states to represent/expose
>         while the server is retraining based on the new examples, etc,
>         but this was a start. 
>
>         (I know OpenRefine doesn't support anything like a
>         reconciliation session yet, and you'd have to have a workflow
>         like "star a bunch of examples that are pre-matched" so the
>         reconcile action could know what to upload before sending
>         candidates up, etc. For now I just tested with a basic python
>         test script)
>
>         Semi-related, I wanted to find something to test with that
>         actually supported training and providing feedback/retraining,
>         so I used the DeDupe Python library: 
>         https://github.com/dedupeio/ <https://github.com/dedupeio/>
>
>         My example server is a little wonky, in part because this was
>         v0.01 and I just wanted to get something done, and I've got
>         some questions out to the DeDupe group on some API usage to
>         try to understand how to better support a reconciliation-like
>         workflow, where you might not have many training examples at
>         the start to tune your weights. 
>
>         It's probably not that far from being somewhat general
>         purpose, and customizable on a per-dataset basis by using the
>         dedupe gazetteer-example tool to train on a few examples to
>         create a generic settings file and editing the 'fields'
>         dictionary in the scripts.
>
>         The nice thing is the team behind DeDupe is really, really
>         good at entity matching and has a great library (and a nice
>         SaaS offering if you want to match or deduplicate your
>         datasets), so DeDupe library could be very useful to other
>         reconciliation API service implementers. DeDupe supports
>         matching over multiple columns (in fact that's the usual
>         usecase) so supporting additional properties in a query is
>         straightforward for the library (though I have it turned off
>         in my v0.01 server)
>
>         I looked at using gitonthescene's csv-reconcile to build from
>         - that server supports plugins, but the plugins only support
>         scoring actual pairs, and DeDupe internally handles pair
>         generation and does its own blocking so it doesn't consider
>         most candidate pairs, so DeDupe is probably not a good fit for
>         csv-reconcile.
>
>         I'd love to talk more about getting training weight/example
>         matches into the reconciliation service protocol, and to talk
>         more about using DeDupe as a backend for reconciliation API
>         servers.   
>
>         Thanks,
>
>         -Erik
>

Received on Thursday, 3 June 2021 06:45:48 UTC