[meetings] Agenda Request - Cross-Device Attribution (#58) from Martin Thomson via GitHub on 2022-06-03 (public-patcg@w3.org from June 2022)

From: Martin Thomson via GitHub <sysbot+gh@w3.org>
Date: Fri, 03 Jun 2022 08:17:53 +0000
To: public-patcg@w3.org
Message-ID: <issues.opened-1259606878-1654244271-sysbot+gh@w3.org>

martinthomson has just created a new issue for https://github.com/patcg/meetings:

== Agenda Request - Cross-Device Attribution ==
## Agenda+: Cross-Device Attribution

We've had a little bit of discussion on the subject of attribution already and the topic of being able to cross contexts seems to be a regular sticking point. I'd like to spend a little more time on the topic in an attempt to drive it to ground.

We probably won't reach a firm conclusion until we're further along with the technological piece, but it might help to understand the problem somewhat better.

There are a few aspects to will need some more discourse to resolve, but I'll try to give a short overview. Emphasis on "try". This turns out to be hard.

If we spend some time on this in a meeting, I'm happy to provide a presentation to help frame the discussion.

### Shape of Solutions

There are a number of potential approaches here. Keeping our discussion out of the details can still be useful. Sticking to the aggregated designs too (as I've said: the event-level options don't provide adequate privacy protection).

The general idea is that there is some means of recognizing source and trigger events as coming from the same person, even if they originate on a different devices. The system has some means of connecting them so that the events can be counted, no matter where they originate.

An abstract system is probably good enough. Matching of events in the browser or OS looks different from matching that happens in an aggregation service. In the former, devices might use some sort of synchronization service to ensure that events collected on one device can be matched on another (c.f., [browser sync mechanisms](https://github.com/WICG/attribution-reporting-api/blob/main/cross_device.md)). In the latter, there needs to be some means of tying devices together so that events can be matched by the aggregation service (c.f., IPA's match key). Each has its own implementation challenges and dynamics, but we can set those aside for the moment and concentrate just on the overall shape, which they both share.

### Utility

The utility of cross-device attribution has probably been well-enough established already. People routinely use multiple devices. Not all of them, but plenty. Being able to perform attribution when an ad is shown on a different device to the ultimate conversion seems like it would do a lot to improve the quality of any attribution system.

Cross-device attribution is an area where existing attribution systems suffer. It is hard to track someone across devices when you are using a design based on tracking activity. Primary identifiers can help cover some of the shortfall here, but that is still less complete.

A common criticism of this feature that might be worth addressing here is that - in the context of removing third-party cookies - this is unlike other aspects of the work were doing. This provides *new* capabilities rather than filling in a gap that was created by making tracking less feasible. In part at least, our mission is to [make advertising better](https://patcg.github.io/charter.html#mission), which doesn't limit our efforts to simply back-filling holes. Also, it is very likely that any system we produce will be functionally worse than tracking-based options, so considering improvements is probably a worthwhile provided that we can stick to our privacy goals.

### Privacy

At least intuitively, releasing more information translates to worse privacy. It seems obvious that enabling cross-device attribution is worse for privacy. The net effect is to make actions that weren't previously traceable, traceable. ■

However, to the extent that any system we produce is able provide real privacy protections, it is not clear to me that rendering more attributions available for consideration is a net loss of privacy. It might even be worse.

For these aggregated designs, we are generally relying on differential privacy as a basis for understanding protections. We use differential privacy to make strong assurances about how the privacy of an individual who contributes to an aggregate is affected. This requires that we protect contributions from individuals by adding noise proportional to the ***maximum amount they might contribute***. While we understand that information release over time is effective unbounded - a shortcoming and open research question both - we can still bound the information release in any given period.

Understanding the maximum amount of information that might be released about someone is tricky if you deliberately avoid learning about what information is contributed by a person. Events from multiple devices that are about the same person cannot be treated as independent. The system needs to adjust for that possibility. It needs to scale noise up to cover for the maximum number of possible contributions. This is bad for utility because you amplify noise.

Most people only have one device, but some people have a lot. Should those users be known as the same person to sites, those sites might get more cross-site information than we might intend if we don't scale the noise properly. But differential privacy protections are based on the maximum contribution, so protecting that information for those users means scaling noise proportional based on the maximum number of devices any user might have.

Look at the options.

![image](https://user-images.githubusercontent.com/67641/171804907-1c802a76-5428-4533-a457-0ff6b48232cb.png)

In the simple case, the answer could be simple, but treating every device as independent leads to cases where the sites can link (at either end) to get more information from the system. That's shown in the second case where attribution happens on both devices in a manner that the system treats as independent, but the sites know is not. A system that is ignorant of the link between the actions of a single person cannot compensate except by making assumptions, which either hurt utility (by spoiling results with noise) or privacy (by allowing too many contributions and effectively reducing $\varepsilon$).

Cross-device attribution adds new information (the third option shown). Only where sites were previously able to correlate use with a single person was it possible to perform attribution in either of the last two cases. But it also ensures that the system is aware of contributions from the same person, which allows it to compensate for multiple contributions (the second and fourth). In our discussions on the design of IPA, we have looked at providing a configurable cap on the contribution of each person using this information, which would let sites tune queries to optimize for better coverage (to allow for more conversions) at a cost of more noise or to get less coverage of different events in exchange for less noise.

Probably the only other thing to say here is that this is only possible if *someone* is able to link activity across devices. Any system we have won't be able to catch them all. That's probably OK - if no one knows the common link between devices, it probably can't be used - to improve or worsen privacy. But as long as someone can link devices to the same person, then that information might be used to undermine privacy.

So the privacy story here is fairly clear in terms of dynamics. It's more a question of what we want to have happen.

There are opportunities in cross-device attribution for really tightening the $\varepsilon$ screws and getting a better understanding of information release.

There is also a risk involved with giving sites new cross-site information that crosses contexts, particularly when we know that we can't rely on differential privacy exclusively.

### Competition

There are quite a few things to think about here, none of which are clear-cut for me.

Some actors (often large ones) are already in a position to do cross-device attribution in *some* cases. Primary identifiers (email, telephone numbers), sign-ons, platform interconnects, and various tricks (IP addresses...naughty) can link device use to the same person. Making that information available in an API could have the effect of reducing the competitive advantage of having this information.

A criticism of IPA is that it provides information about the device graphs of users. Not as much for specific users - that is protected by both aggregation and differential privacy - but the API could reveal information about how the user base of any match key provider extends across multiple devices. That is, though it doesn't reveal specific information about individuals, nor does it reveal anything about reach for single-device users, it does reveal information about the users that use multiple devices with that service in the aggregate. Any entity choosing to provide a match key then needs to decide whether this information is proprietary or not.

The same basic analysis applies to cross-device sync of any matching information. The only difference is that fewer actors are involved in providing the cross-device linkage. Platforms (OS or browser) that synchronize events across devices might expose the extent to which they are able to do this synchronization through any aggregates. Though information about user counts is known, this is new information that would be released.

In both cases, the question is whether the utility gains - and maybe improved privacy - justify release of this information.

In both cases also, providing this information provides a benefit to all other actors in the form of better attribution, even if they don't have access to cross-device information of their own. The thesis behind IPA was that the benefit in terms of better attribution would motivate some entities to make this information available for use. That is, the improved attribution might be worth the loss of competitive advantage, at least for some.

The other thing to maybe consider is whether attribution across devices provides any different benefit by virtue of how it occurs. I don't have any data to base this hunch on, but last-touch, same-device attribution would seem to favour search over display.

All of this is to say that this aspect of the question is complex and probably warrants a bunch more discussion.

Please view or discuss this issue at https://github.com/patcg/meetings/issues/58 using your GitHub account

--
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Friday, 3 June 2022 08:17:55 UTC