- From: Antonin Delpeuch <antonin@delpeuch.eu>
- Date: Thu, 3 Jun 2021 08:44:44 +0200
- To: public-reconciliation@w3.org
- Message-ID: <0e6e4e94-d58c-59d8-7479-b1f863cdaf0f@delpeuch.eu>
Hi Erik, Thank you so much for investigating this! About having the discussion here or on the mailing list, I think it's great to have a thread about it on the mailing list. We have links in both directions so it should not be too hard for people to find the information they need. I guess I should now find the time to make a similar prototype to demonstrate how I would see matching features being used for a client-side ML workflow (without adding any state to the server). Best, Antonin On 03/06/2021 00:33, Erik Paulson wrote: > Vladimir: I commented on that issue a few weeks back, and did > reference it in my email. I always intended to put a pointer in that > issue to this email, however, because that was my first post to the > mailing list it took a couple of days before my email hit the archives > so I couldn't get to it right after posting. Also, much of the email > isn't germane to the issue being discussed - using DeDupe as the > library for matching isn't really part of the protocol extension. > However, there's now a link in the discussion to my message on the > list. I'm happy to take the discussion to Github or to leave it here, > whatever the community norms are for discussions I'm happy to follow. > > Thad: I agree that many of the security/privacy issues will sort > themselves out depending on how #26 goes. I'm not sure that I'd want > to say that the "session" used in authentication should be reused for > the matching. For example, say that we're using API keys, and I have > one API key for a reconciliation API server. I may still want two > different sessions for two different datasets - if I'm trying to > reconcile a dataset of films against the knowledge base, and maybe in > a different OR project I'm reconciling sports venues. I might be using > the same API key for both, but I'd want to train two different > matchers, one for films and one for sports venues, so under my setup, > I'd have those as two different sessions. > > Again, happy to pick this up back on Github or to continue this here, > or to split between them. Whatever folks would normally do! > > -Erik > > > On Tue, Jun 1, 2021 at 9:23 AM Thad Guidry <thadguidry@gmail.com > <mailto:thadguidry@gmail.com>> wrote: > > Erik, > > Regarding security, I don't think we need anything directly in the > API itself other than supporting authentication as we've talked > about in the past in issue #26 > <https://github.com/reconciliation-api/specs/issues/26>. > The reasoning would be if folks want to have secure sessions than > that could still be done over encrypted sessions via HTTPS, TLS, > QUIC, etc. > > Thanks so much for exploring the topic of ML tuning weights! > I do think there should be the ability to have encrypted > authenticated sessions (through whatever encrypted security > protocol layer a learning/recon service chooses to have with clients). > > Have you thought about what is still missing in the API if > anything to support encrypted authenticated sessions? > > Thad > https://www.linkedin.com/in/thadguidry/ > <https://www.linkedin.com/in/thadguidry/> > https://calendly.com/thadguidry/ <https://calendly.com/thadguidry/> > > > On Tue, Jun 1, 2021 at 12:22 AM Erik Paulson > <epaulson@unit1127.com <mailto:epaulson@unit1127.com>> wrote: > > I wanted to explore the ML tuning weights issue discussed here: > https://github.com/reconciliation-api/specs/issues/30 > <https://github.com/reconciliation-api/specs/issues/30> > > specifically, by adding a "session" to the reconciliation API > protocol. I built a very basic version here: > https://github.com/epaulson/reconciliation-dedupe-test > <https://github.com/epaulson/reconciliation-dedupe-test> > > It's just a strawman to get a discussion started, if there's > any interest in going this direction. The server supports > clients creating persistent "sessions" that are unique to that > client instance, and uploading additional training data for > that session (which is not all that different from sending > feedback about match candidates selected.) If a client sends a > reconcile request with that session ID, the server finds the > latest training settings for that session and reconciles using > those weights. If there is no session ID in the reconcile > request, the server falls back and uses the default weights > for the matching. The client can upload additional training > data for a session and the server will retrain the weights > used for that session and use them for future /reconcile > requests for that session. > > There's a bunch more to think about in the protocol about > security, error handling and what states to represent/expose > while the server is retraining based on the new examples, etc, > but this was a start. > > (I know OpenRefine doesn't support anything like a > reconciliation session yet, and you'd have to have a workflow > like "star a bunch of examples that are pre-matched" so the > reconcile action could know what to upload before sending > candidates up, etc. For now I just tested with a basic python > test script) > > Semi-related, I wanted to find something to test with that > actually supported training and providing feedback/retraining, > so I used the DeDupe Python library: > https://github.com/dedupeio/ <https://github.com/dedupeio/> > > My example server is a little wonky, in part because this was > v0.01 and I just wanted to get something done, and I've got > some questions out to the DeDupe group on some API usage to > try to understand how to better support a reconciliation-like > workflow, where you might not have many training examples at > the start to tune your weights. > > It's probably not that far from being somewhat general > purpose, and customizable on a per-dataset basis by using the > dedupe gazetteer-example tool to train on a few examples to > create a generic settings file and editing the 'fields' > dictionary in the scripts. > > The nice thing is the team behind DeDupe is really, really > good at entity matching and has a great library (and a nice > SaaS offering if you want to match or deduplicate your > datasets), so DeDupe library could be very useful to other > reconciliation API service implementers. DeDupe supports > matching over multiple columns (in fact that's the usual > usecase) so supporting additional properties in a query is > straightforward for the library (though I have it turned off > in my v0.01 server) > > I looked at using gitonthescene's csv-reconcile to build from > - that server supports plugins, but the plugins only support > scoring actual pairs, and DeDupe internally handles pair > generation and does its own blocking so it doesn't consider > most candidate pairs, so DeDupe is probably not a good fit for > csv-reconcile. > > I'd love to talk more about getting training weight/example > matches into the reconciliation service protocol, and to talk > more about using DeDupe as a backend for reconciliation API > servers. > > Thanks, > > -Erik >
Received on Thursday, 3 June 2021 06:45:48 UTC