Page:On the Robustness of Topics API to a Re-Identification Attack.pdf/1

From Wikisource
Jump to navigation Jump to search
This page has been validated.


On the Robustness of Topics API to a Re-Identification Attack

Nikhil Jha

Politecnico di Torino Torino, Italy
nikhil.jha at polito.it

Martino Trevisan

Università degli Studi di Trieste Trieste, Italy
martino.trevisan at dia.units.it





Emilio Leonardi

Politecnico di Torino Torino, Italy
emilio.leonardi at polito.it

Marco Mellia

Politecnico di Torino Torino, Italy
marco.mellia at polito.it





Abstract

Web tracking through third-party cookies is considered a threat to users’ privacy and is supposed to be abandoned in the near future. Recently, Google proposed the Topics API framework as a privacy-friendly alternative for behavioural advertising. Using this approach, the browser builds a user profile based on navigation history, which advertisers can access. The Topics API has the possibility of becoming the new standard for behavioural advertising, thus it is necessary to fully understand its operation and find possible limitations. This paper evaluates the robustness of the Topics API to a reidentification attack where an attacker reconstructs the user profile by accumulating user’s exposed topics over time to later re-identify the same user on a different website. Using real traffic traces and realistic population models, we find that the Topics API mitigates but cannot prevent re-identification to take place, as there is a sizeable chance that a user’s profile is unique within a website’s audience. Consequently, the probability of correct re-identification can reach 15 − 17%, considering a pool of 1,000 users. We offer the code and data we use in this work to stimulate further studies and the tuning of the Topic API parameters.

Keywords

Web Privacy, Anonymity, Behavioral Advertising, Topics API

1 Introduction

In the current web ecosystem, targeted or behavioural advertising lets providers monetize their content, by collecting and processing personal data to build accurate user profiles. Among the techniques, web tracking is the most widespread technology [7, 15, 16]. It heavily leverages third-party cookies, that allow tracking platforms to follow the same user on different websites. The mechanism can be summarized as follows: when a user visits a website, a tracker installs a third-party profiling cookie on the user’s client. This cookie contains a unique identifier that lets the tracker identify the user on subsequent visits. When the user visits a second website that embeds the same tracker, the cookie is sent to the tracker, as specifications mandate that cookies are handled on a per-domain basis. As such, the tracker learns that the same user has visited the two websites. Using this mechanism, trackers gather information on users and build profiles describing their interests. Profiles are offered to advertisers so that they can customize the content of the displayed ads. In some cases, tracking platforms employ more sophisticated and privacy-intrusive techniques such as browser fingerprinting or ID synchronization [18, 23]. This massive data collection has created tension between users and the ads ecosystem [10, 15, 24]. Some browsers, such as Mozilla Firefox and Apple Safari, have already started battling third-party cookies. Leading researchers and industries are studying new paradigms that are more respectful of users’ privacy. These new proposals have one common feature: the replacement of third-party cookies and tracking with other techniques that let the user control and limit the amount of disclosed personal information. First, Google proposed the Federated Learning of Cohorts (FLoC) [21]. In FLoC, users are clustered in cohorts according to their interests, computed by each one’s browser based on the user’s recent activity. In the proponents’ intentions, this solution should have prevented tracking, as every user was “hidden” in his/her cohort. However, the approach has been criticised [22]. The main issue is that while a user could hide inside a cohort for a short period of time, the sequence of cohorts they belonged to across time could work as an identifier, increasingly unique. Eventually, Google replaced it with a new proposal called Topics API. With the Topics API, the browser is in charge to build the user’s profile based on the navigation history. Websites can ask for a privacy-preserving version of such profiles to serve targeted advertisements or services. Among the mechanisms in place, the Topic API returns at most one topic per epoch (a week), and randomly replaces 5% of actual topics with random ones. The Topics API framework has the potential to become the new standard for behavioural advertising and replace the current conflicting web tracking system based on third-party cookies. In the first quarter of 2024, Google will deprecate the use of cookies for 1% of Chrome users, to “support developers in conducting real world experiments that assess the readiness and effectiveness of their products without third-party cookies”[1] . It is thus urgent to fully understand the operation of the Topics API, and independent


This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license visit https://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. Proceedings on Privacy Enhancing Technologies YYYY(X), 1–13 © YYYY Copyright held by the owner/author(s). https://doi.org/XXXXXXX.XXXXXXX


  1. https://privacysandbox.com/news/the-next-stages-of-privacy-sandbox-general-availability, accessed on June 9, 2023

1