Re: [w3ctag/design-reviews] Early design review for the Topics API (Issue #726)

Hey folks. Thanks for the discussion so far. Specific responses inline:

> The Topics API as proposed puts the browser in a position of sharing information about the user, derived from their browsing history, with any site that can call the API. This is done in such a way that the user has no fine-grained control over what is revealed, and in what context, or to which parties. It also seems likely that a user would struggle to understand what is even happening; data is gathered and sent behind the scenes, quite opaquely.

Note that the number of sites that can both call the API, and receive an unfiltered response, is quite small. This is because the caller would have to have observed the user on a site about that topic in the past to get through the filter. The vast majority of sites that can call the API will actually receive an empty list. For more details about this observer-based filtering, see [this part of the explainer](https://github.com/patcg-individual-drafts/topics#:~:text=only%20callers%20that%20observed%20the%20user%20visit%20a%20site%20about%20the%20topic%20in%20question%20within%20the%20past%20three%20weeks%20can%20receive%20the%20topic).

Both users and websites can opt out of the Topics API. Clearing any browsing history prevents those sites from affecting the user’s generated Topics. Generally speaking, UX is not part of specification discussion. That said, there is UX provided within Chrome settings to opt out of individual Topics that have been selected, and we’re looking into UX to opt out of any given topic preemptively. Your criticisms all apply to third-party cookies, but in each case Topics offers a very large step forward in understanding and control.


> The responses to the proposal from [Webkit](https://github.com/WebKit/standards-positions/issues/111#issuecomment-1359609317) and [Mozilla](https://github.com/mozilla/standards-positions/issues/622#issuecomment-1372979100) highlight the tradeoffs between serving a diverse global population, and adequately protecting the identities of individuals in a given population. Shortcomings on neither side of these tradeoffs are acceptable for web platform technologies.

It is important to point out the underlying physics that we all must adhere to. Any proposal in this space (by any company) has some notion of a data leakage rate built in. This is true regardless of the choice of privacy mechanism. As time passes, the leakage is additive, and eventually a cross-site identifier can be derived.  It’s a matter of how long it takes to get there. This point of view applies to WebKit's PCM and Mozilla's IPA proposals as well: every API here is about tradeoffs.

For the Topics API, our [study](https://github.com/patcg-individual-drafts/topics/blob/main/topics_analysis.pdf) suggests that it would take tens of weeks of revisiting the same two pages to re-identify the vast majority of users across those pages using only the data from the API. We consider that a substantial win in privacy compared to third-party cookies, where cross-site re-identification takes a single visit. We could make it worst case instead of average case analysis instead (and crank up the random noise), but at a trade-off with utility. These types of analysis and trade-offs are what we expect to continue tuning going forward.

> It's also clear from the positions shared by Mozilla and Webkit that there is a lack of multi-stakeholder support. We remain concerned about fragmentation of the user experience if the Topics API is implemented in a limited number of browsers, and sites that wish to use it prevent access to users of browsers without it (a different scenario from the user having disabled it in settings).

We’re interested in finding solutions to the use case, especially those that garner multi stake-holder support.  That said, the concerns you mention about browser fragmentation do not seem to have prevented similar privacy-related launches in Mozilla or WebKit that increased fragmentation.  And a Chrome migration from third-party cookies to an API like Topics will bring browser behavior much closer together, not drive it further apart.  



> We are particularly concerned by the opportunities for sites to use additional data gathered over time by the Topics API in conjunction with other data gathered about a site visitor, either via other APIs, via out of band means, and/or via existing tracking technologies in place at the same time, such as fingerprinting.

If these sorts of covert tracking practices are in use, then the Topics API will not provide any new information at all — recall that any party that can recognize a person across the various sites in which the party is embedded already has a large superset of the information available to the Topics algorithm.

While extra correlations might be inferred beyond what the taxonomy provides, Topics has significantly better protections against inferring sensitive correlations, compared to third-party cookies or alternative tracking technologies like fingerprinting possible across all browsers.


> Further, if the API were both effective and privacy-preserving, it could nonetheless be used to customise content in a discriminatory manner, using stereotypes, inferences or assumptions based on the topics revealed (eg. a topic could be used - accurately or not - to infer a protected characteristic, which is thereby used in selecting an advert to show). Relatedly, there is no binary assessment that can be made over whether a topic is "sensitive" or not. This can vary depending on context, the circumstances of the person it relates to, as well as change over time for the same person.

These concerns are also discussed in our explainer. In the end, what can be learned from these human-curated topics derived from pages that the user visits is probabilistic, and far less detailed than what cookies can provide with precise cross-site identifiers. While imperfect, this is clearly better for user privacy than cookies. We understand each user cares about different things, and this is why we give controls including to turn off certain topics or to turn off Topics entirely.
 
> Giving the web user access to browser settings to configure which topics can be observed and sent, and from/to which parties, would be a necessary addition to an API such as this, and go some way towards restoring agency of the user, but is by no means sufficient. People can become vulnerable in ways they do not expect, and without notice. People cannot be expected to have a full understanding of every possible topic in the taxonomy as it relates to their personal circumstances, nor of the immediate or knock-on effects of sharing this data with sites and advertisers, and nor can they be expected to continually revise their browser settings as their personal or global circumstances change.

The UX is still evolving here but we already have the ability for users to opt out of the API, and to opt out of individual topics. I generally expect users that have sensitivities to Topics to disable the API as a whole, rather than ferret out individual concerns. You seem to be taking this discussion from the perspective that third-party cookies simply do not exist on the web and that Topics is introducing these behaviors, whereas we’re considering the substantial gain in privacy from where we are with third-party cookies. 

> A portion of topics returned by the API are proposed to be randomised, in part to enable plausible deniability of the results. The usefulness of this mitigation may be limited in practice; an individual who wants to explain away an inappropriate ad served on a shared computer cannot be expected to understand the low level workings of a specific browser API in a contentious, dangerous or embarrassing situation (assuming a general cultural awareness of the idea of targeted ads being served based on your online activities or even being "listened to" by your devices, which does not exist everywhere, but is certainly pervasive in some places/communities).

I wouldn’t expect users to understand that probabilistic deniability is built into the privacy technology that we use today. That said, you seem to be suggesting that personalized advertising in general is bad because someone might look over the user’s shoulder or use their computer and the user might be embarrassed.  I’d note that 1) sharing a computer has far greater embarrassment potential, 2) personalized advertising comes about in many ways (1p data, contextual data, inferences, geo ip, etc) and 3) personalized advertising is often wrong today even with the much more powerful third-party cookies.

I appreciate your feedback and remain open to suggestions you might have on how the API might improve.



-- 
Reply to this email directly or view it on GitHub:
https://github.com/w3ctag/design-reviews/issues/726#issuecomment-1501975149
You are receiving this because you are subscribed to this thread.

Message ID: <w3ctag/design-reviews/issues/726/1501975149@github.com>

Received on Monday, 10 April 2023 15:47:29 UTC