Thoughts on the privacy problem

I've been thinking about this morning's privacy/caching problem a bit while staring off into the middle distance.

Ultimately, I don't think seeding the randomization with some aspect of the document (perhaps a checksum of the whole document, perhaps some mapping from the specific set of codepoints and features) is a productive way forward. If you have a randomization that is determined by the document you can construct a mapping from documents you want to "spy on" to their respective noised sets and check server logs against it.

Now, one thought this brings up is that the possibility of such a mapping isn't a problem as long as there are enough documents that share it. That seems right but I think it may also show that the randomization itself isn't central to the design, and may actually work against it a little.

Our problem is that some of the codepoints in the document may be distinctive, especially in combination with the rest, and we don't want the server to know about that combination. So we throw some other codepoints (some of which may be distinctive) that aren't in the document into the set. Now the server can't tell which were actually in the document and which are ringers, and therefore this is a wider set of documents that could map to that combination. That's what the graphs in Garret's presentations are constructed to show.

Note that I didn't have to say anything about randomization to describe that. Indeed, if you just put all codepoints into groups of four somewhat analogous codepoints (so not random, but grouping obscure-but-still-somewhat-likely-to-appear-in-the-same-document codepoints together) and then always request all four when any one is needed, you'd accomplish (some of) the same thing. (And you'd also make your generated files a bit more cachable rather than less.)

True random-seeded additional noise is a means to the same end, but deterministic noise has too high a chance of some distinctive characteristic for picking out the source document. So maybe the thing to do is think about other means of getting the aliasing we want that aren't noise-based.

Skef

Received on Wednesday, 7 June 2023 04:46:47 UTC