- From: Erik Paulson <epaulson@unit1127.com>
- Date: Thu, 3 Jun 2021 13:49:42 -0500
- Cc: public-reconciliation@w3.org
- Message-ID: <CAKJO4n6uWna-5_AraAMFGEEqt4y07wfNbVGs55W1oH1t=sAtuw@mail.gmail.com>
Just to be clear, I don't think it's a bad thing if there might need to be more state at the server - for some situations that's good and will let you do a better job of matching. I hope that there can be a protocol where it's possible to implement it completely stateless at the server for the base case, and potentially opt into features that need to store some state - the manifest can tell clients what's supported and what isn't on a particular server. That way, clients that don't want to use the stateful features or don't know about them can ignore those features and work like they do today. Also - if folks haven't tried it, the Gazetteer example from dedupe is really easy to get started with, it runs in a terminal and it's just python, no server needed: https://github.com/dedupeio/dedupe-examples it's worth just seeing that workflow. -Erik On Thu, Jun 3, 2021 at 1:47 AM Antonin Delpeuch <antonin@delpeuch.eu> wrote: > Hi Erik, > > Thank you so much for investigating this! > > About having the discussion here or on the mailing list, I think it's > great to have a thread about it on the mailing list. We have links in both > directions so it should not be too hard for people to find the information > they need. > > I guess I should now find the time to make a similar prototype to > demonstrate how I would see matching features being used for a client-side > ML workflow (without adding any state to the server). > > Best, > > Antonin > On 03/06/2021 00:33, Erik Paulson wrote: > > Vladimir: I commented on that issue a few weeks back, and did reference it > in my email. I always intended to put a pointer in that issue to this > email, however, because that was my first post to the mailing list it took > a couple of days before my email hit the archives so I couldn't get to it > right after posting. Also, much of the email isn't germane to the issue > being discussed - using DeDupe as the library for matching isn't really > part of the protocol extension. However, there's now a link in the > discussion to my message on the list. I'm happy to take the discussion to > Github or to leave it here, whatever the community norms are for > discussions I'm happy to follow. > > Thad: I agree that many of the security/privacy issues will sort > themselves out depending on how #26 goes. I'm not sure that I'd want to say > that the "session" used in authentication should be reused for the > matching. For example, say that we're using API keys, and I have one API > key for a reconciliation API server. I may still want two different > sessions for two different datasets - if I'm trying to reconcile a dataset > of films against the knowledge base, and maybe in a different OR project > I'm reconciling sports venues. I might be using the same API key for both, > but I'd want to train two different matchers, one for films and one for > sports venues, so under my setup, I'd have those as two different sessions. > > Again, happy to pick this up back on Github or to continue this here, or > to split between them. Whatever folks would normally do! > > -Erik > > > On Tue, Jun 1, 2021 at 9:23 AM Thad Guidry <thadguidry@gmail.com> wrote: > >> Erik, >> >> Regarding security, I don't think we need anything directly in the API >> itself other than supporting authentication as we've talked about in the >> past in issue #26 <https://github.com/reconciliation-api/specs/issues/26> >> . >> The reasoning would be if folks want to have secure sessions than that >> could still be done over encrypted sessions via HTTPS, TLS, QUIC, etc. >> >> Thanks so much for exploring the topic of ML tuning weights! >> I do think there should be the ability to have encrypted authenticated >> sessions (through whatever encrypted security protocol layer a >> learning/recon service chooses to have with clients). >> >> Have you thought about what is still missing in the API if anything to >> support encrypted authenticated sessions? >> >> Thad >> https://www.linkedin.com/in/thadguidry/ >> https://calendly.com/thadguidry/ >> >> >> On Tue, Jun 1, 2021 at 12:22 AM Erik Paulson <epaulson@unit1127.com> >> wrote: >> >>> I wanted to explore the ML tuning weights issue discussed here: >>> https://github.com/reconciliation-api/specs/issues/30 >>> >>> specifically, by adding a "session" to the reconciliation API protocol. >>> I built a very basic version here: >>> https://github.com/epaulson/reconciliation-dedupe-test >>> >>> It's just a strawman to get a discussion started, if there's any >>> interest in going this direction. The server supports clients creating >>> persistent "sessions" that are unique to that client instance, and >>> uploading additional training data for that session (which is not all that >>> different from sending feedback about match candidates selected.) If a >>> client sends a reconcile request with that session ID, the server finds the >>> latest training settings for that session and reconciles using those >>> weights. If there is no session ID in the reconcile request, the server >>> falls back and uses the default weights for the matching. The client can >>> upload additional training data for a session and the server will retrain >>> the weights used for that session and use them for future /reconcile >>> requests for that session. >>> >>> There's a bunch more to think about in the protocol about security, >>> error handling and what states to represent/expose while the server is >>> retraining based on the new examples, etc, but this was a start. >>> >>> (I know OpenRefine doesn't support anything like a reconciliation >>> session yet, and you'd have to have a workflow like "star a bunch of >>> examples that are pre-matched" so the reconcile action could know what to >>> upload before sending candidates up, etc. For now I just tested with a >>> basic python test script) >>> >>> Semi-related, I wanted to find something to test with that actually >>> supported training and providing feedback/retraining, so I used the DeDupe >>> Python library: >>> https://github.com/dedupeio/ >>> >>> My example server is a little wonky, in part because this was v0.01 and >>> I just wanted to get something done, and I've got some questions out to the >>> DeDupe group on some API usage to try to understand how to better support a >>> reconciliation-like workflow, where you might not have many training >>> examples at the start to tune your weights. >>> >>> It's probably not that far from being somewhat general purpose, and >>> customizable on a per-dataset basis by using the dedupe gazetteer-example >>> tool to train on a few examples to create a generic settings file and >>> editing the 'fields' dictionary in the scripts. >>> >>> The nice thing is the team behind DeDupe is really, really good at >>> entity matching and has a great library (and a nice SaaS offering if you >>> want to match or deduplicate your datasets), so DeDupe library could be >>> very useful to other reconciliation API service implementers. DeDupe >>> supports matching over multiple columns (in fact that's the usual usecase) >>> so supporting additional properties in a query is straightforward for the >>> library (though I have it turned off in my v0.01 server) >>> >>> I looked at using gitonthescene's csv-reconcile to build from - that >>> server supports plugins, but the plugins only support scoring actual pairs, >>> and DeDupe internally handles pair generation and does its own blocking so >>> it doesn't consider most candidate pairs, so DeDupe is probably not a good >>> fit for csv-reconcile. >>> >>> I'd love to talk more about getting training weight/example matches into >>> the reconciliation service protocol, and to talk more about using DeDupe as >>> a backend for reconciliation API servers. >>> >>> Thanks, >>> >>> -Erik >>> >>
Received on Thursday, 3 June 2021 18:50:53 UTC