[w3ctag/design-reviews] Early design review for the Topics API (Issue #726) from Josh Karlin on 2022-03-25 (public-webapps-github@w3.org from March 2022)

From: Josh Karlin <notifications@github.com>
Date: Fri, 25 Mar 2022 12:26:31 -0700
To: w3ctag/design-reviews <design-reviews@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <w3ctag/design-reviews/issues/726@github.com>
Braw mornin' TAG!1

I'm requesting a TAG review of  the Topics API.

The intent of the Topics API is to provide callers (including third-party ad-tech or advertising providers on the page that run script) with coarse-grained advertising topics that the page visitor might currently be interested in. These topics will supplement the contextual signals from the current page and can be combined to help find an appropriate advertisement for the visitor.


  - Explainer¹ (minimally containing user needs and example code): https://github.com/jkarlin/topics

  - User research: [url to public summary/results of research]
  - Security and Privacy self-review²: See below
  - GitHub repo (if you prefer feedback filed there): https://github.com/jkarlin/topics

  - Primary contacts (and their relationship to the specification):
      - Josh Karlin, jkarlin@, Google
      - Yao Xiao, xyaoinum@, Google
  - Organization/project driving the design: Chrome Privacy Sandbox
  - External status/issue trackers for this feature (publicly visible, e.g. Chrome Status): https://chromestatus.com/feature/5680923054964736


Further details:

  - [ x ] I have reviewed the TAG's [Web Platform Design Principles](https://www.w3.org/TR/design-principles/)
  - The group where the incubation/design work on this is being done (or is intended to be done in the future): Either WICG or PATCG
  - The group where standardization of this work is intended to be done ("unknown" if not known): unknown
  - Existing major pieces of multi-stakeholder review or discussion of this design: Lots of discussion on https://github.com/jkarlin/topics/issues/, and a white paper on fingerprintability analysis: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf 
  - Major unresolved issues with or opposition to this design: We believe that the proposed API leans heavily towards user privacy in the privacy/utility tradeoff, as it should. But, the API’s utility isn’t yet clear. Until we try the API in an experiment, we can’t know for sure how the API will perform. Some changes are likely going to be needed. Knobs we may tweak include, but are not limited to, topics in the taxonomy, weights of the topics in the taxonomy, how a site might suggest topics for itself, and how we might get topic data from more places than just the domain (e.g., from the url if there is some signal that the url is privacy safe to parse).
  - This work is being funded by: Chrome

You should also know that...

This API was developed in response to feedback that we (Chrome) received from feedback on our first interest-based advertising proposal, FLoC. That feedback came from TAG, other browsers, Advertisers, and our users. We appreciate this feedback, and look forward to your thoughts on this API.

At the bottom of this issue is both the security survey responses, as well as responses to questions from TAG about [FLoC](https://github.com/w3ctag/design-reviews/issues/601), but answered in terms of Topics.


We'd prefer the TAG provide feedback as (please delete all but the desired option):

  ☂️ open a single issue in our GitHub repo **for the entire review**


## Self Review Questionnaire: Security & Privacy
### 2.1. What information might this feature expose to Web sites or other parties, and for what purposes is that exposure necessary?
* It exposes one of the user’s top-5 topics from the previous week to the caller if the calling context’s site also called the Topics API for the user on a page about that topic in the past three weeks. This is information that could have instead been obtained using third-party cookies. The part that might not have been obtained using third-party cookies is that this is a top topic for the user. This is more global knowledge that a single third-party may not have been able to ascertain.
* 5% of the time the topic is uniformly random.
* The topic comes from a taxonomy. The initial proposed taxonomy is here: https://github.com/jkarlin/topics/blob/main/taxonomy_v1.md

* The topic returned (if one of the top 5 and not the random topic) is random among the top 5, and is set per calling top-frame site. So if any frame on a.com calls the API, it might get the topic with index 3, while b.com callers might get topic at index 1 for the week. This reduces cross-site correlation/fingerprintability.
* Topics are derived only from sites the user visited that called the API.
* Topics are derived only from the domain of the site, not the url or content of the site. Though this may change depending on utility results.


### 2.2 Do features in your specification expose the minimum amount of information necessary to enable their intended uses?
Yes. The entire design of the API is to minimize the amount of information about the user that is exposed in order to provide for the use case. We have also provided a theoretical (and applied) analysis of the cross-site fingerprinting information that is revealed: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf



### 2.3. How do the features in your specification deal with personal information, personally-identifiable information (PII), or information derived from them?

The API intentionally provides some information about the user to the calling context. We’ve reduced the ability to use this information as a global identifier (cross site fingerprinting surface) as much as possible.

### 2.4. How do the features in your specification deal with sensitive information?

Sensitive information is reduced by only allowing topics in the Taxonomy that Chrome and the IAB have deemed are not sensitive (the topics in the proposed initial taxonomy are derived from the two respective organization’s advertising taxonomies).

This does not mean that topics in the taxonomy, or groups of topics in the taxonomy learned about the user over time cannot be correlated sensitive topics. This may be possible.

### 2.5. Do the features in your specification introduce new state for an origin that persists across browsing sessions?
The API provides some information about the user’s browsing history, and this is stored in the browser. The filtering mechanism used to provide a topic to a calling context if and only if that context has observed the user on a page about that topic in the past also stores data. This could be used to learn if the user has visited a specific site in the past (which third-party cookies can do quite easily today) and we’d like to make that hard. There may be interventions that the browser can take to detect and prevent such abuses.

### 2.6. Do the features in your specification expose information about the underlying platform to origins?
No.

### 2.7. Does this specification allow an origin to send data to the underlying platform?
The top-frame site’s domain is read to determine a topic for the site. 


### 2.8. Do features in this specification enable access to device sensors?
No.

### 2.9. Do features in this specification enable new script execution/loading mechanisms?
No.

### 2.10. Do features in this specification allow an origin to access other devices?
No.

### 2.11. Do features in this specification allow an origin some measure of control over a user agent’s native UI?
No.

### 2.12. What temporary identifiers do the features in this specification create or expose to the web?
The topics that are returned by the API. They are per-epoch (week), per-user, and per site. It is cleared when the user clears state.


### 2.13. How does this specification distinguish between behavior in first-party and third-party contexts?
The topic is only returned to the caller if the calling context’s site has also called the API on a domain about that topic with that same user in the past three weeks. So whether the API returns anything or not depends on the calling context’s domain. 

### 2.14. How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode?
The API returns an empty list in incognito mode. We feel that this is safe because there are many reasons that an empty list might be returned. e.g., because the user is new, because the user is in incognito, because the site has not seen this user on relevant sites with the associated topics in the past three weeks, because the user has disabled the API via UX controls.

This is effectively the same behavior as the user being new, so this is basically the API working the same within incognito mode as in regular mode. We could have instead returned random topics in incognito (and for new users) but this has the deleterious effect of significantly polluting the API with noise. Plus, we don’t want to confuse users/developers by having the API return values when they expect it not to (e.g., after disabling the API).

### 2.15. Does this specification have both "Security Considerations" and "Privacy Considerations" sections?
There is no formal specification yet, but the explainer goes into detail on the privacy considerations. The primary security consideration is that the API reveals information beyond third-party cookies in that learning a topic means that the topic is one of the users top topics for the week.

### 2.16. Do features in your specification enable origins to downgrade default security protections?
No.

### 2.17. How does your feature handle non-"fully active" documents?
No special considerations.


## Responses to questions from the FLoC TAG review, as they apply to Topics
### Sensitive categories
> The documentation of "sensitive categories" visible so far are on google ad policy pages. Categories that are considered  "sensitive" are, as stated, not likely to be universal, and are also likely to change over time. I'd like to see:
> * an in-depth treatment of how sensitive categories will be determined (by a diverse set of stakeholders, so that the definition of "sensitive" is not biased by the backgrounds of implementors alone);
> * discussion of if it is possible - and desirable (it might not be) - for sensitive categories to differ based on external factors (eg. geographic region);
> * a persistent and authoritative means of documenting what they are that is not tied to a single implementor or company;
> * how such documentation can be updated and maintained in the long run;
> * and what the spec can do to ensure implementers actually abide by restrictions around sensitive categories.
> Language about erring on the side of user privacy and safety when the "sensitivity" of a category is unknown might be appropriate.

A key difference between Topics and Cohorts is that the Topics taxonomy is human curated, whereas cohorts were the result of a clustering algorithm and had no obvious meaning. The advantage of a topics based approach is that we can help to clarify which topics are exposed. For instance, the initial topology we intend to use includes topics that are in both the IAB’s content taxonomy and Google’s advertising taxonomy. This ensures that at least two separate entities had reviewed the topics for sensitive categories. Assuming that the API is successful, we would be happy to consider a third-party maintainer of the taxonomy that incorporates both relevant advertising interests as well as up-to-date sensitivities.


### Browser support
> I imagine not all browsers will actually want to implement this API. Is the result of this, from an advertisers point of view, that serving personalised ads is not possible in certain browsers? Does this create a risk of platform segmentation in that some websites could detect non-implementation of the API and refuse to serve content altogether (which would severely limit user choice and increase concentration of a smaller set of browsers)? A mitigation for this could be to specify explicitly 'not-implemented' return values for the API calls that are indistinguishable from a full implementation.

> The description of the experimentation phase mentions refreshing cohort data every 7 days; is timing something that will be specified, or is that left to implementations? Is there anything about cohort data "expiry" if a browser is not used (or only used to browse opted-out sites) for a certain period?

As always, it is up to each browser to determine which use cases and APIs it wishes to support. Returning empty lists is completely reasonable. Though a caller could still use the UA to determine if the API is really supported or not. I’m not sure that there is a good solution here.

In regards to the duration of a topic, I think that is likely to be per-UA.

In the Topics API, we ensure that each topic has a minimum number of users, by returning responses uniformly at random 5% of the time. 


### Opting out
> I note that "Whether the browser sends a real FLoC or a random one is user controllable" which is good. I would hope to see some further work on guaranteeing that the "random" FLoCs sent in this situation does not become a de-facto "user who has disabled FLoC" cohort.
> It's worth further thought about how sending a random "real" FLoC affects personalised advertising the user sees - when it is essentially personalised to someone who isn't them. It might be better for disabling FLoC to behave the same as incognito mode, where a "null" value is sent, indicating to the advertiser that personalised advertising is not possible in this case.
> I note that sites can opt out of being included in the input set. Good! I would be more comfortable if sites had to explicitly opt in though.
> Have you also thought about more granular controls for the end user which would allow them to see the list of sites included from their browsing history (and which features of the sites are used) and selectively exclude/include them?
> If I am reading this correctly, sites that opt out of being included in the cohort input data cannot access the cohort information from the API themselves. Sites may have very legitimate reasons for opting out (eg. they serve sensitive content and wish to protect their visitors from any kind of tracking) yet be supported by ad revenue themselves. It is important to better explore the implications of this.

The current plan is for the Topics API to return an empty list in incognito mode. 

Sites opt in via using the API. If the API is not used, the site will not be included. Sites can also prevent third parties from calling the API on their site via permission policy.

In regards to granular controls, we feel that this is possible with Topics (less so with FLoC) and expect to expose via UX the topics that are being returned, and allowing users to opt out of the API completely or disable individual topics.

The API is designed to facilitate ecosystem participation - as calling the API is both the way to contribute and receive value from the API. We do not want sites to be able to get topics without also supporting the ecosystem.
 
### Centralisation of ad targeting
> Centralisation is a big concern here. This proposal makes it the responsibility of browser vendors (a small group) to determine what categories of user are of interest to advertisers for targeting. This may make it difficult for smaller organisations to compete or innovate in this space. What mitigations can we expect to see for this?
> How transparent / auditable are the algorithms used to generates the cohorts going to be? When some browser vendors are also advertising companies, how to separate concerns and ensure the privacy needs of users are always put first?
 
The Topics API helps to address broad, granular topics based advertising. For more niche topics, we suggest the usage of alternative sandbox APIs like FLEDGE.
In terms of transparency, the API is written plainly in open source code, the design is occurring on github with an active community, and the ML model used to classify topics will be available for anyone to evaluate.
 
### Accessing cohort information
> I can't see any information about how cohorts are described to advertisers, other than their "short cohort name". How does an advertiser know what ads to serve to a cohort given the value "43A7"? Are the cohort descriptions/metadata served out of band to advertisers? I would like an idea of what this looks like.
 
With Topics, the Taxonomy name is its semantic meaning.


### Security & privacy concerns
> I would like to challenge the assertion that there are no security impacts.
> * A large set of potentially very sensitive personal data is being collected by the browser to enable cohort generation. The impact of a security vulnerability causing this data to be leaked could be great.

In Chrome, the renderer is only aware of the topic for the given site. The browser stores information about which callers were on each top-level site, and whether the API was called. This is significantly better than the data stored for third-party cookies.

> * The explainer acknowledges that sites that already know PII about the user can record their cohort - potentially gathering more data about the user than they could ever possibly have access to without explicit input from the user - but dismisses this risk by comparing it to the status quo, and does not mention this risk in the Security & Privacy self-check.

The Topics API, unlike FLoC, only allows a site to learn topics if the caller has observed the user on a site about that topic. So it is no longer easy to learn more about the user than they could have without explicit input from the user. 

> * Sites which log cohort data for their visitors (with or without supplementary PII) will be able to log changes in this data over time, which may turn into a fingerprinting vector or allow them to infer other information about the user.

Topics is more difficult to use as a cross-site fingerprinting vector due to the fact that different sites receive different topics during the same week. We have a white paper studying the impact of this: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf

Logging data over time does still increase knowledge about the user however. We’ve limited this as much as we think is possible.
 
> * We have seen over past years the tendency for sites to gather and hoard data that they don't actually need for anything specific, just because they can. The temptation to track cohort data alongside any other user data they have with such a straightforward API may be great. This in turn increases the risk to users when data breaches inevitably occur, and correlations can be made between known PII and cohorts.

The filtering mentioned above (not returning the topic if it was observed by the calling context for that user on a site about that topic) significantly cuts down on this hoarding. It’s no longer possible for any arbitrary caller on a page to learn the user’s browsing topics.

> * How many cohorts can one user be in? When a user is in multiple cohorts, what are the correlation risks related to the intersection of multiple cohorts? "Thousands" of users per cohort is not really that many. Membership to a hundred cohorts could quickly become identifying.

There are only 349 topics in the proposed Topics API, and 5% of the time a uniformly random topic is returned. We expect there to be significantly more users per topic that there were in FLoC.


-- 
Reply to this email directly or view it on GitHub:
https://github.com/w3ctag/design-reviews/issues/726

You are receiving this because you are subscribed to this thread.

Message ID: <w3ctag/design-reviews/issues/726@github.com>
Received on Friday, 25 March 2022 19:26:44 UTC