Re: A server using the DeDupe library/exploring ML tuning weights from Erik Paulson on 2021-06-03 (public-reconciliation@w3.org from June 2021)

From: Erik Paulson <epaulson@unit1127.com>
Date: Thu, 3 Jun 2021 13:49:42 -0500
Cc: public-reconciliation@w3.org
Message-ID: <CAKJO4n6uWna-5_AraAMFGEEqt4y07wfNbVGs55W1oH1t=sAtuw@mail.gmail.com>
Just to be clear, I don't think it's a bad thing if there might need to be
more state at the server - for some situations that's good and will let you
do a better job of matching.

I hope that there can be a protocol where it's possible to implement it
completely stateless at the server for the base case, and potentially opt
into features that need to store some state - the manifest can tell clients
what's supported and what isn't on a particular server. That way, clients
that don't want to use the stateful features or don't know about them can
ignore those features and work like they do today.

Also - if folks haven't tried it, the Gazetteer example from dedupe is
really easy to get started with, it runs in a terminal and it's just
python, no server needed:
https://github.com/dedupeio/dedupe-examples

it's worth just seeing that workflow.

-Erik


On Thu, Jun 3, 2021 at 1:47 AM Antonin Delpeuch <antonin@delpeuch.eu> wrote:

> Hi Erik,
>
> Thank you so much for investigating this!
>
> About having the discussion here or on the mailing list, I think it's
> great to have a thread about it on the mailing list. We have links in both
> directions so it should not be too hard for people to find the information
> they need.
>
> I guess I should now find the time to make a similar prototype to
> demonstrate how I would see matching features being used for a client-side
> ML workflow (without adding any state to the server).
>
> Best,
>
> Antonin
> On 03/06/2021 00:33, Erik Paulson wrote:
>
> Vladimir: I commented on that issue a few weeks back, and did reference it
> in my email. I always intended to put a pointer in that issue to this
> email, however, because that was my first post to the mailing list it took
> a couple of days before my email hit the archives so I couldn't get to it
> right after posting. Also, much of the email isn't germane to the issue
> being discussed - using DeDupe as the library for matching isn't really
> part of the protocol extension. However, there's now a link in the
> discussion to my message on the list. I'm happy to take the discussion to
> Github or to leave it here, whatever the community norms are for
> discussions I'm happy to follow.
>
> Thad: I agree that many of the security/privacy issues will sort
> themselves out depending on how #26 goes. I'm not sure that I'd want to say
> that the "session" used in authentication should be reused for the
> matching. For example, say that we're using API keys, and I have one API
> key for a reconciliation API server. I may still want two different
> sessions for two different datasets - if I'm trying to reconcile a dataset
> of films against the knowledge base, and maybe in a different OR project
> I'm reconciling sports venues. I might be using the same API key for both,
> but I'd want to train two different matchers, one for films and one for
> sports venues, so under my setup, I'd have those as two different sessions.
>
> Again, happy to pick this up back on Github or to continue this here, or
> to split between them. Whatever folks would normally do!
>
> -Erik
>
>
> On Tue, Jun 1, 2021 at 9:23 AM Thad Guidry <thadguidry@gmail.com> wrote:
>
>> Erik,
>>
>> Regarding security, I don't think we need anything directly in the API
>> itself other than supporting authentication as we've talked about in the
>> past in issue #26 <https://github.com/reconciliation-api/specs/issues/26>
>> .
>> The reasoning would be if folks want to have secure sessions than that
>> could still be done over encrypted sessions via HTTPS, TLS, QUIC, etc.
>>
>> Thanks so much for exploring the topic of ML tuning weights!
>> I do think there should be the ability to have encrypted authenticated
>> sessions (through whatever encrypted security protocol layer a
>> learning/recon service chooses to have with clients).
>>
>> Have you thought about what is still missing in the API if anything to
>> support encrypted authenticated sessions?
>>
>> Thad
>> https://www.linkedin.com/in/thadguidry/
>> https://calendly.com/thadguidry/
>>
>>
>> On Tue, Jun 1, 2021 at 12:22 AM Erik Paulson <epaulson@unit1127.com>
>> wrote:
>>
>>> I wanted to explore the ML tuning weights issue discussed here:
>>> https://github.com/reconciliation-api/specs/issues/30
>>>
>>> specifically, by adding a "session" to the reconciliation API protocol.
>>> I built a very basic version here:
>>> https://github.com/epaulson/reconciliation-dedupe-test
>>>
>>> It's just a strawman to get a discussion started, if there's any
>>> interest in going this direction. The server supports clients creating
>>> persistent "sessions" that are unique to that client instance, and
>>> uploading additional training data for that session (which is not all that
>>> different from sending feedback about match candidates selected.) If a
>>> client sends a reconcile request with that session ID, the server finds the
>>> latest training settings for that session and reconciles using those
>>> weights. If there is no session ID in the reconcile request, the server
>>> falls back and uses the default weights for the matching. The client can
>>> upload additional training data for a session and the server will retrain
>>> the weights used for that session and use them for future /reconcile
>>> requests for that session.
>>>
>>> There's a bunch more to think about in the protocol about security,
>>> error handling and what states to represent/expose while the server is
>>> retraining based on the new examples, etc, but this was a start.
>>>
>>> (I know OpenRefine doesn't support anything like a reconciliation
>>> session yet, and you'd have to have a workflow like "star a bunch of
>>> examples that are pre-matched" so the reconcile action could know what to
>>> upload before sending candidates up, etc. For now I just tested with a
>>> basic python test script)
>>>
>>> Semi-related, I wanted to find something to test with that actually
>>> supported training and providing feedback/retraining, so I used the DeDupe
>>> Python library:
>>> https://github.com/dedupeio/
>>>
>>> My example server is a little wonky, in part because this was v0.01 and
>>> I just wanted to get something done, and I've got some questions out to the
>>> DeDupe group on some API usage to try to understand how to better support a
>>> reconciliation-like workflow, where you might not have many training
>>> examples at the start to tune your weights.
>>>
>>> It's probably not that far from being somewhat general purpose, and
>>> customizable on a per-dataset basis by using the dedupe gazetteer-example
>>> tool to train on a few examples to create a generic settings file and
>>> editing the 'fields' dictionary in the scripts.
>>>
>>> The nice thing is the team behind DeDupe is really, really good at
>>> entity matching and has a great library (and a nice SaaS offering if you
>>> want to match or deduplicate your datasets), so DeDupe library could be
>>> very useful to other reconciliation API service implementers. DeDupe
>>> supports matching over multiple columns (in fact that's the usual usecase)
>>> so supporting additional properties in a query is straightforward for the
>>> library (though I have it turned off in my v0.01 server)
>>>
>>> I looked at using gitonthescene's csv-reconcile to build from - that
>>> server supports plugins, but the plugins only support scoring actual pairs,
>>> and DeDupe internally handles pair generation and does its own blocking so
>>> it doesn't consider most candidate pairs, so DeDupe is probably not a good
>>> fit for csv-reconcile.
>>>
>>> I'd love to talk more about getting training weight/example matches into
>>> the reconciliation service protocol, and to talk more about using DeDupe as
>>> a backend for reconciliation API servers.
>>>
>>> Thanks,
>>>
>>> -Erik
>>>
>>
Received on Thursday, 3 June 2021 18:50:53 UTC