Re: A server using the DeDupe library/exploring ML tuning weights

Vladimir: I commented on that issue a few weeks back, and did reference it
in my email. I always intended to put a pointer in that issue to this
email, however, because that was my first post to the mailing list it took
a couple of days before my email hit the archives so I couldn't get to it
right after posting. Also, much of the email isn't germane to the issue
being discussed - using DeDupe as the library for matching isn't really
part of the protocol extension. However, there's now a link in the
discussion to my message on the list. I'm happy to take the discussion to
Github or to leave it here, whatever the community norms are for
discussions I'm happy to follow.

Thad: I agree that many of the security/privacy issues will sort themselves
out depending on how #26 goes. I'm not sure that I'd want to say that the
"session" used in authentication should be reused for the matching. For
example, say that we're using API keys, and I have one API key for a
reconciliation API server. I may still want two different sessions for two
different datasets - if I'm trying to reconcile a dataset of films against
the knowledge base, and maybe in a different OR project I'm reconciling
sports venues. I might be using the same API key for both, but I'd want to
train two different matchers, one for films and one for sports venues, so
under my setup, I'd have those as two different sessions.

Again, happy to pick this up back on Github or to continue this here, or to
split between them. Whatever folks would normally do!

-Erik


On Tue, Jun 1, 2021 at 9:23 AM Thad Guidry <thadguidry@gmail.com> wrote:

> Erik,
>
> Regarding security, I don't think we need anything directly in the API
> itself other than supporting authentication as we've talked about in the
> past in issue #26 <https://github.com/reconciliation-api/specs/issues/26>.
> The reasoning would be if folks want to have secure sessions than that
> could still be done over encrypted sessions via HTTPS, TLS, QUIC, etc.
>
> Thanks so much for exploring the topic of ML tuning weights!
> I do think there should be the ability to have encrypted authenticated
> sessions (through whatever encrypted security protocol layer a
> learning/recon service chooses to have with clients).
>
> Have you thought about what is still missing in the API if anything to
> support encrypted authenticated sessions?
>
> Thad
> https://www.linkedin.com/in/thadguidry/
> https://calendly.com/thadguidry/
>
>
> On Tue, Jun 1, 2021 at 12:22 AM Erik Paulson <epaulson@unit1127.com>
> wrote:
>
>> I wanted to explore the ML tuning weights issue discussed here:
>> https://github.com/reconciliation-api/specs/issues/30
>>
>> specifically, by adding a "session" to the reconciliation API protocol. I
>> built a very basic version here:
>> https://github.com/epaulson/reconciliation-dedupe-test
>>
>> It's just a strawman to get a discussion started, if there's any interest
>> in going this direction. The server supports clients creating persistent
>> "sessions" that are unique to that client instance, and uploading
>> additional training data for that session (which is not all that different
>> from sending feedback about match candidates selected.) If a client sends a
>> reconcile request with that session ID, the server finds the latest
>> training settings for that session and reconciles using those weights. If
>> there is no session ID in the reconcile request, the server falls back and
>> uses the default weights for the matching. The client can upload additional
>> training data for a session and the server will retrain the weights used
>> for that session and use them for future /reconcile requests for that
>> session.
>>
>> There's a bunch more to think about in the protocol about security, error
>> handling and what states to represent/expose while the server is retraining
>> based on the new examples, etc, but this was a start.
>>
>> (I know OpenRefine doesn't support anything like a reconciliation session
>> yet, and you'd have to have a workflow like "star a bunch of examples that
>> are pre-matched" so the reconcile action could know what to upload before
>> sending candidates up, etc. For now I just tested with a basic python test
>> script)
>>
>> Semi-related, I wanted to find something to test with that actually
>> supported training and providing feedback/retraining, so I used the DeDupe
>> Python library:
>> https://github.com/dedupeio/
>>
>> My example server is a little wonky, in part because this was v0.01 and I
>> just wanted to get something done, and I've got some questions out to the
>> DeDupe group on some API usage to try to understand how to better support a
>> reconciliation-like workflow, where you might not have many training
>> examples at the start to tune your weights.
>>
>> It's probably not that far from being somewhat general purpose, and
>> customizable on a per-dataset basis by using the dedupe gazetteer-example
>> tool to train on a few examples to create a generic settings file and
>> editing the 'fields' dictionary in the scripts.
>>
>> The nice thing is the team behind DeDupe is really, really good at entity
>> matching and has a great library (and a nice SaaS offering if you want to
>> match or deduplicate your datasets), so DeDupe library could be very useful
>> to other reconciliation API service implementers. DeDupe supports matching
>> over multiple columns (in fact that's the usual usecase) so supporting
>> additional properties in a query is straightforward for the library (though
>> I have it turned off in my v0.01 server)
>>
>> I looked at using gitonthescene's csv-reconcile to build from - that
>> server supports plugins, but the plugins only support scoring actual pairs,
>> and DeDupe internally handles pair generation and does its own blocking so
>> it doesn't consider most candidate pairs, so DeDupe is probably not a good
>> fit for csv-reconcile.
>>
>> I'd love to talk more about getting training weight/example matches into
>> the reconciliation service protocol, and to talk more about using DeDupe as
>> a backend for reconciliation API servers.
>>
>> Thanks,
>>
>> -Erik
>>
>

Received on Wednesday, 2 June 2021 22:34:40 UTC