Re: Thoughts on the privacy problem

I should note that unicode range which I simulated is essentially
equivalent to this using group sizes of ~100 and that performed very well
in the simulations for number of matches. So I'm optimistic that we should
be able to find a smaller group size which performs good, while providing a
much lower overall number of codepoints transferred then the larger unicode
range groups.

On Wed, Jun 7, 2023 at 6:41 PM Garret Rieger <grieger@google.com> wrote:

> Agreed that using deterministic noise (seeded from the content) is not
> viable for the reasons you mentioned.
>
> I like the idea of including groups of codepoints together. The
> specification actually had a mechanism for this in the very early days,
> where the code point mapping provided a list of codepoint groups that could
> be requested. This was primarily intended as a performance optimization to
> reduce the size of the codepoint sets that needed to be transmitted in the
> request. Eventually it was dropped once other techniques gave good enough
> compression of the sets. Given that it's potentially useful from the
> privacy side of things we may want to bring it back.
>
> I'll update my noise simulations to also simulate codepoint grouping and
> see how that fairs. If it looks good then it might be a good solution to
> the privacy problem that also works with caching.
>
> On Tue, Jun 6, 2023 at 10:46 PM Skef Iterum <siterum@adobe.com> wrote:
>
>> I've been thinking about this morning's privacy/caching problem a bit
>> while staring off into the middle distance.
>>
>> Ultimately, I don't think seeding the randomization with some aspect of
>> the document (perhaps a checksum of the whole document, perhaps some
>> mapping from the specific set of codepoints and features) is a productive
>> way forward. If you have a randomization that is determined by the document
>> you can construct a mapping from documents you want to "spy on" to their
>> respective noised sets and check server logs against it.
>>
>> Now, one thought this brings up is that the possibility of such a mapping
>> isn't a problem as long as there are enough documents that share it. That
>> seems right but I think it may also show that the randomization itself
>> isn't central to the design, and may actually work against it a little.
>>
>> Our problem is that some of the codepoints in the document may be
>> distinctive, especially in combination with the rest, and we don't want the
>> server to know about that combination. So we throw some other codepoints
>> (some of which may be distinctive) that aren't in the document into the
>> set. Now the server can't tell which were actually in the document and
>> which are ringers, and therefore this is a wider set of documents that
>> could map to that combination. That's what the graphs in Garret's
>> presentations are constructed to show.
>>
>> Note that I didn't have to say anything about randomization to describe
>> that. Indeed, if you just put all codepoints into groups of four somewhat
>> analogous codepoints (so not random, but grouping
>> obscure-but-still-somewhat-likely-to-appear-in-the-same-document codepoints
>> together) and then always request all four when any one is needed, you'd
>> accomplish (some of) the same thing. (And you'd also make your generated
>> files a bit more cachable rather than less.)
>>
>> True random-seeded additional noise is a means to the same end, but
>> deterministic noise has too high a chance of some distinctive
>> characteristic for picking out the source document. So maybe the thing to
>> do is think about other means of getting the aliasing we want that aren't
>> noise-based.
>>
>> Skef
>>
>

Received on Thursday, 8 June 2023 00:44:43 UTC