Re: [docs-and-reports] Principle: Don't use Entropy (#4) from Martin Thomson via GitHub on 2022-06-20 (public-patcg@w3.org from June 2022)

From: Martin Thomson via GitHub <sysbot+gh@w3.org>
Date: Mon, 20 Jun 2022 07:51:30 +0000
To: public-patcg@w3.org
Message-ID: <issue_comment.created-1160097619-1655711489-sysbot+gh@w3.org>

> If there were a privacy change that helped reduce at-scale privacy loss but kept the worst-case equal, I think we ought to consider making that change in a PATCG proposal.

As a motherhood statement, this is easy to agree with. I want to get at the basis for my concerns with this "at scale" metric though, because I think that any consideration of those effects needs to be secondary, and strictly so.

> [...] at-scale metrics [...] help represent whether pervasive tracking on the web can occur (e.g. many people tracked at once).

This is not a claim that I can agree with in the same way as the previous. The same goes for the economic aspect, which I don't consider to be a separate point; whether tracking is pervasive is a direct consequence of the economics. We see extensive tracking of online activity because it has become sufficiently efficient to do so relative to the value it provides.

The problem I see with basing decisions on an at-scale assessment is that any defense against tracking is fragile. Any individual person who has their identity linked across sites suffers an irrevocable loss of privacy. Once two sites have determined that user $u_a$ on site $a$ is the same as user $u_b$ on site $b$, no more information is needed. The two sites - or third parties present on both sites - can link the activities of that user on the two sites forever.

My understanding is that we're more or less resigned to the continuous release of information over time. With that, if a design allows sites to link identities for some users, then sites can accumulate cross-site linkages. Once a user is linked, they can be excluded from the set of users that participate in measurement, progressively improving the ability to link the identity of other users.

We are also committed to providing information that is scaled by site (with pairwise information revealing more information). That allows for transitive linkages to be created. Identity relationships are commutative across sites, so $u_a=u_b$ and $u_b=u_c$ reveals that $u_a=u_c$ without $a$ and $c$ expending any of their tracking budget to learn that. More indirection adds to fragility, but it also allows for reinforcement of weaker linkages.

Though it might seem like this concern is about being able to target individuals, the concern is more rooted in providing high-fidelity information that crosses sites. Any system that creates clean demarcations between groups of users can be harnessed for tracking over time. Best case, sites aren't able to control how users are allocated into groups, so what information is released comes down to chance. Sites might be able to further refine groupings using external signals, like fingerprints, but this is still dependent on chance. Whether individuals are identifiable as a result will depend on how the population of visitors to the site is composed.

Proposals that don't give sites at least some control over how users are allocated to groupings are rare. For measurement at least, designs that offer no control to sites tend to be pretty useless. Proposals that give sites control over the allocation of users to groups allow for adaptive techniques that can progressively segment populations. Though groupings might be large, progressive divisions allow the identity of all users to be resolved in $\lceil\log_{G}(N)\rceil$ iterations for $N$ users and $G$ groupings. Even when $G$ is small and $N$ large, this doesn't take long. Dividing the population of the planet into just 2 groups each week provides information sufficient to uniquely identify every single person in 34 weeks.

This doesn't need to assume that the mechanisms we are considering are dedicated solely to tracking. Depending on the design, this sort of information leakage could be available for use in linking cross-site identity without impeding the use of the information for conversion measurement.

This is something of a redeeming characteristic of PCM: if you want to use PCM for tracking, you really need to dedicate all of its use to that end. Any use of PCM for tracking really eats into what ever you might gain from measuring conversions.

In comparison, the Chrome event-level proposal offers 2 billion identifiers per person, which offers plenty of opportunities to target each person many times. A true conversion can preempt any tracking attempt using trigger priority, simultaneously ensuring that conversion measurement works while performing tracking. Each conversion then just reduces the amount of tracking information that user provides. The filtering feature might make even prioritization unnecessary. The added randomization on each navigation event (p=0.0024 currently) is almost low enough that it can be ignored. Event sources (p=2.5e-6) can also be turned to this end, with 1=real, 0=tracking, with the observation that there is a third state where no report is generated. Careful sites can then run multiple tests to cancel this noise, up to the cap on the rate that reports can be generated.

This leads neatly to my final point. The introduction of meaningful amounts of noise - as introduced by differential privacy - does change things. $\varepsilon=14$ isn't much noise[^1], so it can almost be ignored, but differential privacy with smaller values of $\varepsilon$ make it difficult to attribute a single, unambiguous value to an individual. Maybe we are ultimately just talking about time-to-identification either way, but I like to think that we can resolve to pick a value of $\varepsilon$ that is a little better than that.

[^1]: $\frac{Pr(F(x)=S)}{Pr(F(x')=S)} < 1202604$ is a pretty big ratio of probabilities; you might say that it makes any hypothesis about $x$ over $x'$ fairly easy to advance with just one sample.

--
GitHub Notification of comment by martinthomson
Please view or discuss this issue at https://github.com/patcg/docs-and-reports/issues/4#issuecomment-1160097619 using your GitHub account

--
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Monday, 20 June 2022 07:51:32 UTC