Website Data Transparency in the Browser

Data collection by websites and their integrated third parties is often not transparent. We design privacy interfaces for the browser to help people understand who is collecting which data from them. In a proof of concept browser extension, Privacy Pioneer, we implement a privacy popup, a privacy history interface, and a watchlist to notify people when their data is collected. For detecting location data collection, we develop a machine learning model based on TinyBERT, which reaches an average F1 score of 0.94. We supplement our model with deterministic methods to detect trackers, collection of personal data, and other monetization techniques. In a usability study with 100 participants 82% found Privacy Pioneer easy to understand and 90% found it useful indicating the value of privacy interfaces directly integrated in the browser.


INTRODUCTION
Openness and transparency are cornerstones of data protection and the right to privacy.Per the OECD's fair information practice principles [58], " [t]here should be a general policy of openness".Further, "the existence and nature of personal data, and the main purposes of their use, as well as the identity and usual residence of the data controller" should be known.Per the GDPR [28], "transparency requires that any information [...] be easily accessible and easy to understand, [...] in particular, information to the data subjects on the identity of the controller and the purposes of the processing." However, the reality of privacy on the web is different.Many people feel a lack of transparency and control over what data is collected from them and by whom [13].The web privacy problem is a transparency problem [26].
When people visit a website, it is generally not their intent to interact with ad networks, data brokers, or other third parties' who collect their data on the site for tracking and advertising purposes. 1Recent results suggest that, while third party tracking is omnipresent, many people are unaware of it [49].This state of affairs poses challenges for the viability of notice and choice.While privacy policies are intended to provide notice and make data collection practices more transparent, they are not fit for this purpose.They take too long to read [53].They also do not always accurately reflect how data is processed [84,86].A recent survey posits that genuine, informed consent at scale may not be possible [74].
Privacy policies have value as reference documents for regulators to hold site owners accountable.However, privacy labels, permission notifications, and other short-form notices are more informative for everyday use [23,45].In this work, we show how short-form notices can be generated automatically from web traffic while the user is browsing.We believe dynamic notices -showing in real-time which data is going where -are more informative than static descriptions of abstract privacy practices that "may" happen.When presented with their data in the "My Google Activity" dashboard, a third of the participants in a recent study were surprised by its scope and detail but at the same time viewed the data collection more beneficially [29].Seeing data collection live is a more faithful representation of actual privacy practices than the often diverging descriptions in privacy policies or labels [47,48,84,86].
We design and implement privacy interfaces to dynamically identify who is receiving which data at runtime in the browser.Our interfaces are intended for direct browser integration.Extraneous software would have limited reach, functionality, and usability.We understand our work as a step towards automating notice and choice in the browser by making notices backwards-traceable to analyzed code [83].As a proof of concept we implement our interfaces in a Firefox browser extension, Privacy Pioneer. 2 In addition to pattern-based detection, we use a machine learning model to classify unstructured data, in our case location data, for which the web traffic context plays a significant role.By doing so, we show the potential improvements obtainable from machine learning models while still accounting for the constrained browser environment.
With this study, we hope to contribute towards making websites' data collection practices more transparent: (1) We design and implement privacy interfaces in a browser extension, Privacy Pioneer, to show how the dynamic analysis of data collection practices by websites can be directly integrated into the web browser.Our machine learning model for identifying location data reaches an average F1 score of 0.94.We supplement our model with deterministic methods to detect trackers, collection of personal data, and other monetization techniques.( §3) (2) In a usability study with 100 participants we evaluate the comprehensibility and utility of Privacy Pioneer's privacy interfaces -a privacy popup, privacy history, and watchlist notifications -to help people understand the data collection practices of the sites they visit, including third parties.Overall, 82% of the participants found Privacy Pioneer easy to understand and 90% found it useful indicating the value of privacy interfaces directly integrated in the browser.( §4)

BACKGROUND AND RELATED WORK
Engaging people with the profiles from their web journey can create more trustworthy and positive experiences with targeted ads [5].

Data Transparency on the Web
After all, the understanding of how targeted ads work is often based on inaccurate folk models [79].Profiles from long-term tracking can be constructed via a topic modeling algorithm run client-side on the data of the trackers people encounter when browsing the web [77].
To that end, it is the goal of various browser extensions, desktop apps, mobile apps, and websites to make ad tracking, browser fingerprinting, and other privacy-invasive behavior more transparent.The closest work to ours is Solitude [36], a desktop app for inspecting web or mobile app traffic and notifying people when their data, e.g., a location or email address, is collected.Unlike Solitude, which is a standalone app, our goal is to build transparency functionality directly into the browser making it broadly available.We use a lightweight approach within the browser environment without relying on a VPN or web proxy as required by Solitude.Usability is critical for improving transparency for average people.Privacy Pioneer integrates into the browser environment creating a privacy history similar to a browsing history.It displays the practices of the current site in a popup and can display a browser notification upon a site collecting data.Beyond Solitude's deterministic techniques, which we also use ( §3.2), e.g., by matching known tracker URLs to Firefox's integrated Disconnect Tracker Protection lists [19], we leverage a machine learning model to disambiguate people's data from site data ( §3.4). 3  We build on ideas of existing browser extensions, in particular, Lightbeam [54], Ghostery [32], Privacy Badger [25], and Duck-DuckGo Privacy Essentials [22].We complement their data recipientbased approach with a data category-based approach.Especially, a recent study suggested that the category of data being tracked is more important than who the web trackers are [79].Indeed, data categories matter ( § 4.3.2).Someone may be fine with a site knowing their interest in listening to audiobooks, but they may rather not have their phone number or email address collected.In such cases people may appreciate a notification that a site just collected or is about to collect a piece of sensitive information.In a recent study on personal privacy assistants such a notification feature was ranked the highest [73].We explore it here as well through our watchlist interface. 3Solitude's classification accuracy and computational performance are not reported.

Web Traffic Analysis
We perform traffic analysis to extract useful and sensitive information from observed network traffic [62].For the web OpenWPM provided a measurement infrastructure to detect, quantify, and characterize emerging online tracking behaviors, such as browser fingerprinting [27].OpenWPM was extended to invisible login forms triggering autofilling of saved user credentials, exfiltrating social network data, and other privacy-invasive practices [2].In addition to OpenWPM, OmniCrawl, a similar infrastructure, was used to find that the third party advertising and tracking ecosystem on mobile browsers is similar to that of desktop browsers [12].
Our work is based on foundational techniques for analyzing web messages for various data tracking practices, such as browser fingerprinting [3,24,57], the use of tracking pixels [38,66], and the collection of location data [6].Email addresses typed into forms can be collected by third party scripts even when people leave the site without submitting the form, which is especially concerning as email addresses are commonly used as identifiers for constructing profiles over time [69].Browser fingerprinting, tracking pixel usage, location data collection, and data collected form-field entries are all surfaced in Privacy Pioneer's interfaces.
In light of third party cookies being phased out on all major browsers [11], we expect to see a rise in browser fingerprinting.The accuracy of detecting browser fingerprinting can be improved via machine learning methods [42].Those can also improve the detection of phishing sites [4] or malicious sites in general [50].
Here we make use of a machine learning model to identify location data collection, in particular, to detect an individual's city, latitude, longitude, region, and ZIP code.Making sense of such location data often requires the broader context of the HTTP message in which it occurs to disambiguate, for example, whether a city is where the user is located or where the site owner can be contacted.

Privacy Dashboards
Privacy dashboards can help people to review and control the data collected about them [65], e.g., Blacklight allows people to enter a URL of any site to learn about its privacy practices [52].Privacy dashboards are also increasingly built directly into the browser, e.g., Firefox's Protections Dashboard [56].Here we are exploring three different privacy interfaces: (1) a privacy popup displaying data collection practices of the current site and its integrated third parties, (2) a privacy history over all browsing sessions, and (3) a privacy watchlist that keeps track of a user's custom keywords triggering notifications upon sites collecting those ( §3.1.3).Considering tracking data, interest data, and raw technical data, most participants in a recent user study found interest data to be the most informative [75].Displaying it in usable way is key [75].

Privacy Notices
Making privacy notices usable is a major challenge.Various design dimensions, e.g., the timing of notices, should be accounted for [67].Poli-see explored how to best visualize privacy notices [34].Concise and salient representations are promising [23]."Nutrition labels" for privacy have been discussed [45], particularly, in the IoT space [15,63] and are used on app stores.Apple's privacy labels were found to be useful, though, prone to misconceptions [80] and sometimes inaccurate and misleading [48].Some apps were shown to violate their label by transmitting data without declaring so [47].The same was shown for apps' privacy policies [84,86].
Automated and dynamic privacy notice generation can help.For example, to align apps' privacy policies with their actual data practices, policies can be, at least, partially, generated from their code [81].This dynamic analysis is also what we are pursuing here.Privacy Pioneer observes the actual behavior of a site, analyzes it, and creates a label for it.This dynamic creation has the advantage of giving people a much more accurate, concrete, and up-to-date picture of what is happening with their data compared to a static and abstract notice.It also opens up opportunities for personalized privacy notices based on user characteristics [46,64].Automatically generating privacy information has been fruitful, e.g., for answering privacy questions or assigning privacy icons [35].

PRIVACY PIONEER IMPLEMENTATION
The web browser is the natural instrument for notifying people about the data collection practices of the websites they visit.

Architecture
Our definition of data collection encompasses both legal and surreptitious data collection by first and third parties.
3.1.1Goals, Requirements, and Non-goals.We want to design and implement privacy analysis functionality and interfaces for use in the browser to make data collection practices of websites transparent to web users as they browse the web.As far as possible, the data analysis and interfaces should not interrupt people's browsing or impact the browser's computational performance.As far as possible, the analysis should also work locally without data disclosure.It is not our goal to achieve a comprehensive coverage of all collected data, all third parties, or all web traffic.Rather, we want to evaluate the effectiveness of our overall approach.While we make use of various methods to identify potentially privacy-invasive practices, e.g., browser fingerprinting, our goal is not the improvement of individual methods.We take a holistic view of the detected practices and aim to surface them in a usable way in the browser.It is also not our goal to provide a choice mechanism for the detected practices, though, detecting them will also enable choice ( §5.2).
3.1.2Privacy Analysis Overview.Figure 1 shows an overview of Privacy Pioneer's architecture.After applying APIs available in Firefox for listening and filtering HTTP messages, those messages are searched for data collected by websites and integrated third party scripts using probabilistic and deterministic methods.In particular, location data collection is detected by a machine learning model.Other data categories are detected deterministically by known attributes and string matching using regular expressions and URL lists.For the latter we use the Firefox-integrated Disconnect Tracker Protection lists [19].Privacy Pioneer analyzes the following HTTP elements searching for various data categories: HTTP Elements −→ Privacy Pioneer searches for relevant data in the following HTTP message (response and request) elements: • HTTP Headers -Request and Response URL: The URL present in the request or response  It is not necessary to decrypt any encrypted HTTP messages as those are available in plaintext when accessed through the browser APIs.Once evidence for data collection is found, it is analyzed, and the analysis results are locally stored in the browser.If the evidence supports a positive classification, the detected practice is ready to be displayed in the privacy interfaces (Figure 2).

Deterministic Analysis
To identify data collection practices Privacy Pioneer makes use of both deterministic and probabilistic analysis methods.The deterministic analysis is based on three methods: URL list matching, regular expression matching, and attribute-based matching.

Analysis Methods.
For the monetization categories -advertising, analytics, and social networking -Privacy Pioneer matches their URLs based on the Disconnect Tracker Protection lists [19].These lists are included in Firefox's Enhanced Tracking Protection.Specifically, the webRequest.onHeadersReceivedAPI exposes the urlClassification object that indicates the type of tracking associated with a request, if any.Data for personal categories -email addresses, phone numbers, street addresses, and user-entered custom keywords -is identified based on regular expression matches. 4s personal category data is more diverse than static monetization URLs, we leverage data formats, e.g., the email address format, to increase the identification accuracy for such data.

Support Precision
For supplementing the offering of the extension, we also added support for detecting tracking categories, such as browser fingerprints, tracking pixels, and IP addresses, via specialized regular expressions.Privacy Pioneer is looking for an IP address in the body of an HTTP message, which is an indicator that it is used for tracking, as opposed to an IP address in the header that is used to deliver the message to the correct recipient.To identify browser fingerprinting and tracking pixels we use both attribute-and listbased identification methods.A tracking pixel is identified if it is included in a manually curated URL list of known tracking pixels or if a set of four attributes is detected: (1) an image file, (2) with height and width properties set to 0 or 1, (3) containing the word "pixel, " and (4) containing a "?" character.For browser fingerprinting, including canvas fingerprinting, we follow a similar approach.We identify fingerprinters statically based on a list of known fingerprinting URLs (sourced from the urlClassification object) or dynamically based on function calls to fingerprinting libraries, such as Fingerprint2, or use of APIs, like WebGL.

Classification Performance.
To evaluate the performance of our deterministic classifiers we created a test set (the deterministic test set) of 56 sites that would have a high probability of positive instances of the various data categories Privacy Pioneer is intended to detect.We created the deterministic test set by randomly selecting one technology for each category -advertising, analytics, social networking, browser fingerprinting, and tracking pixelfrom the Disconnect Tracker Protection lists in Firefox or our own URL lists.For the IP address category we randomly selected one IPto-location API based on a web search.Then, for each technology, we searched BuiltWith [7] for sites that integrate it and randomly picked 8 sites for inclusion in our deterministic test set for a total of 48 sites.The remaining 8 sites are search engines, selected from a ranked list [31], to test for entered email addresses, phone numbers, street addresses, and custom keywords.We also tested for data from these categories by interacting with the other 48 sites, to the extent possible, by signing up to a site with an email address, password (i.e., custom keyword), phone number, and street address.
Table 1 shows the performance of our deterministic classifiers with a manual inspection of the observed web traffic as ground truth.

Location Data.
While regular expressions work well for identifying collection of personal data, such as email addresses (Table 1), they perform worse for collection of location data, such as ZIP codes (Table 2).Identifying a pattern as a user's ZIP code or other location is often dependent on the context in which it appears.Whether a location is the user's location or the location of the visited site cannot be solely determined by matching characters and formats.Another challenge is that SVG paths often have patterns that resemble location data leading to a significant number of false positives.The problem is less pronounced for latitudes and longitudes, possibly, because those specify a smaller geographical area in a more distinct format.Locations also lack distinctive attributes compared to, say, tracking pixels, which usually occur in an image file, making it difficult to apply attribute-based identification.Given these challenges of deterministically identifying location data collection we apply a machine learning model to perform a probabilistic analysis.Using a machine learning model instead of rigid rules is also beneficial for identifying location data in different formats, e.g., ZIP codes from different countries, as well as evolving or changing location data formats.

Location Dataset Creation
For developing and evaluating the performance of our machine learning model we created a location dataset for detecting the presence of people's location data in HTTP messages.

Data Collection.
Figure 3 shows an overview of the dataset creation process.To ensure we would have a high coverage of We created our location dataset by performing two web crawls during which we captured HTTP messages containing location data as detected by regular expressions.In the post-processing phase we masked the identified location data to avoid over-fitting the model and removed SVG paths from the dataset as those were clear false positives.We then sampled a random set of 5,472 HTTP messages and imported them into Doccano [20] for the subsequent annotation.Note that a site may generate multiple HTTP messages as it serves images, style sheets, scripts, and other resources.
the variety of location data formats we performed web crawls connecting to 65 VPNs from 35 countries.Specifically, ZIP code formats differ from country to country (for example, US: 12345, Japan: 123-4567, India: 123456).Thus, diversifying our dataset to include multiple countries' formats helped ensuring that our model would not over-fit to any specific country's format.The protocol for crawling on one VPN was as follows: (1) Connect to the VPN.
(2) Query the ipstack IP-to-location API to retrieve the VPN's city, latitude, longitude, region, and ZIP code.(3) Connect in sequence to a set of websites from the Tranco list [60].The time-out for each site was set to 15 seconds.(4) Identify location data in HTTP messages based on regular expressions matching the VPN's city, latitude, longitude, region, or ZIP code.(5) Save the HTTP messages that were successfully matched for post-processing and annotation.
We parallelized our crawls with eight browsers at a time via the browser automation framework Puppeteer [33] and the puppeteercluster library [21].The two crawls differed in the websites visited.The first crawl covered for each VPN the top 500 most popular websites globally per the Tranco list.The goal was to capture data from sites with high-volume of traffic as many people would be exposed to their privacy practices.The second crawl covered the top 100 most popular sites of the country where the VPN was located according to the Tranco list.We associated a site with a country based on its top-level country domain.This crawl aimed to capture the privacy practices of popular websites accounting for a diverse set of localized data formats.In total, both crawls generated 98,643 HTTP messages potentially containing location data.
After merging the HTTP messages from the two crawls we postprocessed them by (1) masking locations, as identified by our regular expressions, and (2) removing SVG paths.We masked locations in the dataset, e.g., replacing "Boston" with the label "<TARGET-CITY>, " to avoid over-fitting our model to specific locations.While SVG paths can contain numbers that look like ZIP codes, for example, they would always be false positives.Then, we randomly sampled 5,472 HTTP messages for annotation.Our goals for the annotation were to obtain even distributions of data instances (1) across the different location data categories and (2) across the different VPNs.With a target of at least 1,000 data instances per category we sampled an equal amount of instances per VPN.If a data category was more common in the VPNs, we sampled fewer data instances per VPN.If it was less common, we sampled more.

Data
Annotation.The set of 5,472 HTTP message instances was imported into Doccano [20], an open source annotation tool we set up.The dataset consisted of instances from all five location data categories (city: 1,115, latitude: 1,068, longitude: 1,051, region: 1,078, and ZIP code: 1,160).Before importing the dataset we truncated each message instance to 250 characters before and after the regular expression match (or fewer characters if the message was shorter or the match occurred near one of the message ends).These truncated messages instances would provide sufficient context for why the match occurred and prevent length bias.Each instance could be a true or false positive.A true positive means that the regular expression match correctly identified an instance of data collection.A false positive means that it identified an instance incorrectly, for example, if a news article mentioned the name of the city where the VPN was located.For each category, 10% of the data, selected randomly, was annotated by three authors reaching an inter-annotator agreement between 0.79 and 0.83 as measured by Krippendorff's alpha.(Table 3). 6These levels of agreement indicate that the 10% triple-annotated data are sufficiently reliable to serve as test set for our machine learning model (the probabilistic test set).The rest of the annotated data was used for training (80%) and validation (10%) with each data instance being annotated by a single author.

Probabilistic Analysis
To overcome the challenges of classifying location data collection in HTTP messages we developed a machine learning model.

Machine Learning
Baseline.We began our model development by exploring a lightweight machine learning baseline.Using our data we trained SVM classifiers with bags-of-words [44].We tuned the classifiers with the default set of hyperparameters [68].They performed much better than our regular expressions (Table 2).These baseline results demonstrate that machine learning classifiers are the right direction for improving the accuracy of identifying location data collection in HTTP messages.We set out to further increase the classification performance with a deep learning model.Figure 4 shows an overview of our model development and predictions at runtime.

Identifying Candidate Instances.
Before applying the probabilistic analysis Privacy Pioneer first identifies candidate instances of location data collection based on regular expression matches.
Only matched HTTP messages are considered for the probabilistic analysis while unmatched messages are discarded.Given the perfect recall (Table 2), it would be rare to miss positive instances.To provide sufficient context for our model we found that truncated messages with 250 characters before and after a regular expression match yield good results ( §3.3.2).If a message was shorter to begin with, we padded it.Truncating and padding each message to a standard length before feeding it into our model also prevents length bias.To identify padding to the model we use an attention mask.

3.4.3
Selecting a Pre-trained Model.Our analysis is based on a pre-trained model.As a starting point, we selected the Bidirectional Encoder Representations from Transformers (BERT) family of models [17], specifically, BERT-Base, as implemented in Python via the Hugging Face [78] and PyTorch [59] libraries.BERT models are pre-trained on a large corpus of natural language data and can be further trained for domain-specific tasks.However, given its file size of 450MB, it proved challenging to integrate BERT-Base into Privacy Pioneer under the constraints of the browser environment.Thus, we explored TinyBERT [43], which, compared to BERT-Base, is smaller and faster with a file size of 59MB.Also, on average across the five categories the TinyBERT model classifies an instance 12.6x faster on our data compared to the BERT-Base model.

Tokenization.
For tokenization we use a WordPiece tokenizer.
When we converted our model from Python to JavaScript for running it in the browser, we implemented the equivalent tokenizer [14] in Privacy Pioneer., a hyperparameter optimization library, we found that the most impactful parameters for classification accuracy were batch size and learning rate.Our best performing model was trained for 50 epochs, with early stopping, using a batch size of 8, a learning rate of 5 * 10 −6 , and a weight decay of 0.1.

Multitask Model.
To maximize model efficiency we explored using a multitask model for analyzing data of all five location data categories.This method has the advantage of reducing the number of models from five category-specific models to a single one that can analyze inputs of all categories.To evaluate this method we trained BERT-Base and TinyBERT models on a training set from all five categories.We added the category to the beginning of each training instance to indicate to the model which category of location data it is looking at.Table 4 shows the results of our evaluation.
3.4.7 Classification Performance.For the most part the Multi models perform on par or within a few percentage points difference compared to the Singles models, of which BERT-Base Singles perform the best.Both BERT-Base Multi and TinyBERT Multi exhibit similar performance with an average accuracy of 94% indicating that we can rely on the 7.6x smaller model for our classification tasks.TinyBERT Multi results in 49% and 16% relative F1 score improvement over the use of regular expression and the SVM baseline, respectively (Table 2).Comparing the SVM baseline to the TinyBERT Multi results (Table 5), we observe a substantial F1 score increase for city (0.52 to 0.84), region (0.71 to 1.00), and ZIP code (0.74 to 1.00).The performance improved less for latitude (0.92 to 0.94) and stayed the same for longitude (0.91).We believe that the performance improvements for city, region, and ZIP code are largely based on our model's ability of disambiguating the context in which they occur, e.g., whether a city occurs in a news article or designates the user's location.Context seems to matter less for identifying the more distinct formats of latitudes and longitudes, whose regular expression-based F1 scores were already higher with We found a number of messages with over 100,000 characters (left).Those created substantial amount of work, defined as a percentage of the total characters that Privacy Pioneer would be searching through (middle).However, they only exhibited few instances of data collection, i.e., privacy labels created (right).
0.77 and 0.73, respectively (Table 2).Overall, we believe that the application of machine learning models will have its greatest impact for the identification of generic data categories that do not have a distinctive format and for which, consequently, only their context reveals their purpose of use.

Knowledge Distillation.
We tried improving the classification performance of our TinyBERT Multi model via knowledge distillation [37].To that end, we used the BERT-Base models, trained on the 5,472 annotated instances to programmatically annotate all remaining unannotated data leading to 98,643 annotated instances.This significantly larger set of annotated data was then fed as training data into a fresh set of models with TinyBERT as the pre-trained model.However, given the already close performance between the teacher and the student models when trained directly, distillation did not improve classification performance overall.Thus, we kept the TinyBERT Multi model, trained on 5,472 annotated instances, as our final model.
3.4.9Model Integration.As shown in Figure 4, to integrate our model into Privacy Pioneer we converted it from PyTorch in Python to TensorFlow in Python [1].Then, we used tfjs-converter to convert it from TensorFlow in Python to TensorFlow in JavaScript for use in TensorFlow.js [70], a JavaScript library with a set of APIs for running TensorFlow models in the browser or server-side. 7pon installation of Privacy Pioneer, our model is served from GitHub and downloaded to an IndexedDB instance in the browser. 8 This approach enables efficient usage of the model at runtime as it is immediately available locally at all times and across all browsing sessions.Also, the use of an IndexedDB instance helps ensure user privacy by storing the model and all analyzed data locally.

Computational Performance
Analyzing HTTP messages dynamically at runtime can decrease computational performance and impact usability.We implemented various heuristics to reduce the analysis workload by filtering out messages and message parts that are likely irrelevant for detecting data collection practices: • Do not analyze messages exceeding 100,000 characters • Only analyze the following webRequest.ResourceTypes: image (e.g., used for tracking pixels) -script (e.g., used for browser fingerprinting scripts) -sub_frame (e.g., iframes for loading third party sites) -xmlhttprequest (can contain any type of data) • Only analyze request body, response body, and selected headers as those can contain user-specific data ( § 3.1.2)Applying these heuristics resulted in substantially decreased workloads with minimal information loss (Figure 5).Comparing Privacy Pioneer's runs with and without these heuristics via Apple Activity Monitor showed a decrease of Firefox's WebExtension CPU usage by an average of 52%, from 12.96% to 6.26%, across three runs.The performance evaluation was run on a 2023 MacBook Pro with an Apple M2 Pro processor, 16 GB RAM, and with no user programs running besides Firefox and Activity Monitor.
To evaluate the performance cost of adding Privacy Pioneer to Firefox we randomly sampled 50 sites from the Tranco list, visited the sites with and without Privacy Pioneer turned on, repeated the process for a total of three runs, and then measured the time to load a site using Firefox's Network Monitor, which has a load variable that records when a resource finished loading [30].The average time to load a site with Privacy Pioneer was 2.09 seconds while the average time to load a site without it was 1.93, thus, adding 0.16 seconds for an 8% increase.We find the additional load time tolerable given the transparency gain.The performance evaluation was run on a 2019 MacBook Pro with 1.4 GHz Quad-Core Intel Core i5 processor, 16 GB RAM, and with no user programs running besides Firefox.

USABILITY OF PRIVACY PIONEER
We tested the privacy interfaces we designed and implemented in Privacy Pioneer in an online usability study. 9We structured our survey questions around our core inquiry of web transparency  with the specific goals of determining the comprehensibility and usefulness of the privacy interfaces in Privacy Pioneer.As part of these goals we seek to inform the priorities and opportunities for the future design of privacy interfaces.

Experimental Setup
We recruited participants for our study on the crowdworking platform Prolific [61].
4.1.1Eligibility Criteria.The eligibility criteria to participate in our study were: (1) having installed or willingness to install Firefox on a laptop or desktop computer, (2) fluency in English, (3) United States residency, (4) 100% approval rate for previous tasks on Prolific, ( 5) completion of at least 50 previous tasks on Prolific, and ( 6) a minimum age of 18 years.As Privacy Pioneer is only available on Firefox its use as part of the study was mandatory.For the criteria ( 2) -( 6) we relied on the information provided by Prolific.

Study
Procedure.We signed up a total of 100 participants.Participants first answered a few general web privacy and technology questions (Q2-Q12) and then engaged in three guided tasks with our browser extension, after which they shared their experience (Q13-Q47). 10Our survey contained one attention check question, which all participants answered correctly.After participants had installed our browser extension from the Firefox Add-ons store, we asked them to perform the three tasks.
The first task asked participants to identify trackers on a website via the privacy popup, then change their Firefox settings to block them, and finally confirm via the privacy popup that the trackers are indeed no longer active.The second task asked participants to identify trackers across sites via the privacy history interface.Specifically, we gave them a set of five different sites.After visiting each site they should check their privacy history to identify the tracker that was active on multiple sites.The third task asked participants to enable notifications via the privacy watchlist to be fired when a visited site shares a custom keyword with another site.
To ensure that participants actually performed the tasks we asked them to send us screenshots and a file that logged their interface interactions, e.g., which Privacy Pioneer buttons they pressed and which sites they visited.For each participant we created a personalized upload link to an anonymous cloud drive that we shared via the Prolific messaging system, which we also used to troubleshoot technical problems or clarify questions that participants had.After 10 The set of survey questions and tasks is shown in Appendix 8.4.
finishing their tasks and answering the survey questions participants submitted their work and were given a completion code they could enter on Prolific to receive their compensation.

Ethical Considerations.
We received IRB approval for our study.At sign-up we let participants know who we are and how they could contact us.We explained the study purpose, i.e., to find out how websites' data collection and sharing practices can be made more transparent for web users.We further explained that the study consists of (1) installing and using our Firefox browser extension and (2) answering survey questions and submitting files about their usage and data privacy in general.We provided them a list of the categories of data that we would request from them. 11We explained that the data would be stored at our organizations and our service providers using current best practices, that it would not be disclosed except in aggregate form, and that we would retain a copy after the study for record-keeping purposes.During the study, data was only stored locally on participants' computers.We then asked them to submit their data via an anonymous cloud drive using their Prolific ID, a pseudonym by which we identified participants.We explained to them that they could view and delete any data before their submission and that they could withdraw from the study at any time.At sign-up, we also explained to participants that the IP-to-location service IPinfo will receive their IP address.They had to manually enable this functionality in the extension, at which time they were notified once more.For their participation we paid each participant $9, the amount recommended by Prolific for a study like ours.

Sample Representativeness
Our study sample is partially representative of the US population (Table 6).We compared our participants' demographics to the US population demographics derived from the 2021 American Community Survey [8][9][10].The age distribution of our participants aligns with the expected figures (sample median: 38.0 years; population median: 38.8 years [16]).We also find our sample to be representative with regards to ethnicity.However, we note disparities in terms of sex, student status, and employment status.Our sample has more male (sample: 66%) than female participants.Our study also attracted fewer students compared to the national proportion (sample: 12%; population: 25% [10]), which may be attributable to our study's exclusion of minors, who constitute a large portion of the US student body.As to employment, we note a large proportion of unemployed individuals (sample: 15%; population: 4% [9]), likely due to Prolific being a paid crowdworking platform.
We asked participants which operating system and web browser they primarily use.Utilizing market share statistics for US desktop users, we find that our sample includes a disproportionately high percentage of Windows users (sample: 81%; population: 59% [72]) and a corresponding lower percentage of macOS users (sample: 18%; population: 32% [72]).The browser distribution shows a substantially higher Firefox usage rate (sample: 23%; population: 5% [71]) while featuring very few Safari users (sample: 5%; population: 21% [71]).The general under-representation of Apple users with regards to both browsers and OS, as well as the over-representation of Firefox users, may well be a consequence of the conditions of our study, which required participants to make use of Privacy Pioneer as a Firefox-exclusive extension.
As indicated by the relatively higher Firefox usage rate, our sample skews towards people with an interest in protecting their privacy.Also, as the self-rating of tech-savviness indicates -with 59% of the participants believing that they are either tech-savvy or very tech-savvy (Figure 6) -our sample seems to skew towards advanced web users.These trends are also confirmed by 76% of participants answering that they are using some form of privacy software (Q8). 12On the other hand, there are 24% reporting to not use any privacy software and 41% who do not consider themselves particularly tech-savvy.Thus, we also have a contingent of participants who may be less concerned about their privacy or who need help in understanding how their data is being collected.

Usability Evaluation
The results of our usability study suggest that privacy interfaces that show which data is collected by whom can help people understand websites' data collection practices.Our results further suggest that such interfaces should be directly integrated into the web browser in an easy-to-use, informative, and actionable form. 13  4.3.1 Insufficiency of Privacy Protection and Data Transparency.A majority of participants expressed concern about their privacy on the web.60% disagreed or strongly disagreed with the statement "I generally feel confident that my privacy is protected on the web" (Q2) (Figure 7).However, a number of participants felt they have measures at their disposal to protect their privacy (Q3) (Figure 7).41% disagreed or strongly disagreed with the statement "There is not much I can do to protect my privacy on the web."This view, 12 As we asked participants to submit screenshots of the interfaces we presented to them, we have an indication that their use of such software, if any, did not interfere with their use of Privacy Pioneer. 13The findings in this section are based on summary statistics from the data collected in our usability study.We evaluate correlations using the Kendall Tau coefficient.The correlations are derived from 22 survey questions (Appendix 8.4).We compared the answer distributions for each question pair for a total of 231 pairwise comparisons.The set of pairwise comparisons includes all linear scale questions (per Appendix 8.4), except for the attention check question, Q29.We treat Q6 ("Do you care whether a website shares the following of your data for ad purposes?")as 8 separate questions, one for each of the 7 data categories and one that sums the other 7 for each participant.Also included in the pairwise comparisons is the multiple choice question Q45 ("How likely is it that you would recommend Privacy Pioneer to a friend or colleague?").Of all comparisons, 59 were significant (p<=0.05corrected for multiple tests with the Benjamini-Yekutieli procedure) and 53 additionally had a correlation coefficient higher than 0.3, which we consider as the minimum value to indicate a moderate correlation.For testing the goodness of fit of non-ordinal data we use the Chi-square test.Participants were also generally hesitant to claim a good understanding of data collection and sharing practices (Q4).Many, however, felt that they can do something to protect their privacy (Q3).
however, is contingent on whether or not participants reported using privacy software (Q8). 14Only 27% of participants who reported not using privacy software disagreed or strongly disagreed with the statement in Q2, that is, felt able to protect their privacy.The largest proportion of such participants, 41%, answered "Neutral" suggesting a potential lack of knowledge on how well their privacy is protected on the web.Indeed, 43% of participants overall disagreed or strongly disagreed with the statement "I generally feel that I have a good understanding of what data websites collect from me and with whom they share it" (Q4) (Figure 7).These results point to a lack of data transparency on the web.They are particularly noteworthy as 59% participants in our study rate themselves as tech-savvy or very tech-savvy (Q12) (Figure 6).

Data and Recipient Categories Matter.
It is the purpose of our interfaces to provide people with specific, yet, comprehensible, information of who is collecting which categories of data.We note a clear imperative for providing such granular detail.We asked participants about different categories of personal data being shared for ad purposes, and they conveyed their opinions towards each as one of five tiers of caring, ranging from "Do not care at all" to "Care a  lot" (Q6) (Figure 8).75% of participants used at least 3 tiers of caring across the 7 different data categories we listed, thus, indicating that sentiments varied between different categories for many.Likewise, categories such as participants' interests showed little consensus on whether or not sharing mattered: 27% of participants cared to some degree, 47% did not, and 26% expressed a neutral view.
A similarly broad spectrum of views exists for the categories of data recipients.While 41% of participants were opposed to all sharing of interest data, 59% had more granular preferences (Figure 9).Participants' nuanced privacy preferences could ultimately provide a new avenue for ad personalization ( §5.3).When asked if they would use an ad/tracking blocker capable of selectively allowing certain sites to receive certain data (Q9), 59% of participants responded "I would use it if it would keep the sites free," while another 28% said they would continue using some existing ad/tracking blocker.Participants, thus, held diverse privacy preferences regarding different organizations and data categories, especially, when they had to consider paying for content.

Privacy Interfaces Have Promise to Help People Understand
Who Collects Which Data.Participants' perceptions of the utility and clarity of the three privacy interfaces were broadly favorable (Figure 10).Overall, participants expressed agreement or strong agreement that the popup was the easiest to understand (93%), followed by the history interface (81%), and then the watchlist (65%).Most participants ranked the usefulness of the interfaces in the same Figure 11: Participants were asked to enter the keyword "Batman" into their watchlist, search on imdb.com for "Batman," and check the popup to identify any sharing of the keyword.To confirm their understanding, the survey showed a popup (right), for which 76% of the participants selected the correct answer (left).
order with 93%, 90%, and 80%, respectively, agreeing or strongly agreeing.These results are not entirely independent of participants' reported tech-savviness.For the privacy history interface we note a statistically significant and moderate correlation between how participants rated their tech-savviness and the degree to which they found the interface useful and understandable. 15The popup showed a marginally weaker correlation. 16The watchlist correlation was not statistically significant. 17These results suggest that less techsavvy participants struggled more with Privacy Pioneer's interfaces while overall impressions remain favorable.
When asked to enter the keyword "Batman" in the watchlist and identify whether it was shared per the third guided task, 76% of  coefficient = 0.274, p = 0.057 (barely insignificant).Q12 (tech-savviness) vs. Q24 (usefulness of the popup): Kendall Tau test,  coefficient = 0.297, p = 0.026. 17p>0.05for both usefulness and understanding.
Figure 12: 90% of participants rated Privacy Pioneer overall as useful independently of their understanding of data collection and sharing practices.There was no significant correlation between the level of understanding of privacy practices and the perceived utility of Privacy Pioneer (Kendall Tau test,  coefficient = -0.019,p = 1.000).However, there was a correlation between understanding Privacy Pioneer (Q39) and recommending it to friends or colleagues (Q45) (Appendix 8.6, Figure 22).
the participants picked the correct answer (Figure 11).The relatively weaker reception of the watchlist compared to the popup and privacy history may have been influenced by participants being expected to track a dummy keyword, which may have left them wondering about the purpose of the watchlist.The lesser understanding of the watchlist's purpose -as opposed to its functionality -could be a reason for the low correlation between Q12 (tech-savviness) and Q36 (comprehensibility of the watchlist interface) and Q37 (usefulness of the watchlist interface), respectively.However, overall, Privacy Pioneer was regarded by a majority of participants as useful with 90% agreeing or strongly agreeing regardless of how they rated their pre-existing understanding about website data collection and sharing (Figure 12).

Privacy
Interfaces Should Be Directly Integrated in the Browser.Most participants found it desirable to have the tested privacy interfaces directly integrated in the browser.The popup received the highest rate of approval (85%) (Figure 13).17% of participants expressed that they would continue using Privacy Pioneer on Firefox (11% expressed that they would not) (Q44).65% of participants expressed that they would continue using Privacy Pioneer if it were available for their main browser (7% expressed that they would not).Across questions, a small number of participants conveyed dissatisfaction over the performance of Privacy Pioneer's interfaces, particularly, as to the loading time of the popup.8% of participants reported technical issues during the study, some of which pertained to the study as opposed to Privacy Pioneer (Q42).When asked about improvement suggestions, 5% of participants' focused on improvements relating to performance or tediousness.n what they liked, 52% responded they found the interfaces easy to use and understand.50% valued the information shown.20% expressed excitement and had no improvement suggestions.Still, some participants suggested to simplify language (18%), create tutorials and make functionality more clear (15%), and provide interface improvements (15%).Participants were particularly in favor of the inclusion of better usage instructions, such as a video tutorial.Some participants also would have wanted a better explanation of the trackers and what can be done with the information (10%) or added features to protect privacy (5%).This finding indicates that people want both transparency and control, that is, both notice and choice.

DISCUSSION
We developed Privacy Pioneer's privacy interfaces to help improving data transparency in the web ecosystem.

Automating Notice for More Transparency
Our interfaces are intended to help people becoming aware of the data collection practices they are subjected to.Many privacy laws are based on notice and choice.However, in its current form notice and choice has proven to be ineffective.Automating notice in the browser holds the promise of making websites' privacy practices more transparent.Our results suggest that browser-driven dynamic analysis of websites' data collection practices is technically feasible with good accuracy and tolerable computational effort ( §3).Among the interfaces we studied, the privacy popup (Figure 2), which displays data collection practices of the currently visited site, was perceived as the most useful and easiest to understand with 93% of participants agreeing or strongly agreeing (Figure 10).These levels of agreement compare favorably to the general perception of privacy policies as bloated and unreadable [53].Since the generated notices are backwards-traceable to the dynamically analyzed code [83], the identified practices can be precise, up-to-date, and shown in context.Participants' general preference for the popup indicates that privacy notices should be made easily accessible without requiring people to navigate to a hard to find privacy settings page or performing multiple clicks to access a privacy interface.

Enabling People to Make Privacy Choices
Transparency of data collection practices is a necessary prerequisite for enabling people to act on these practices and exercise their privacy rights.Informed choice demands notifications that are clear, comprehensive, and actionable.43% of study participants stated that they did not have a good understanding of the data collection and sharing practices they are subjected to on the web (Figure 7).Indeed, many people lack awareness about who is collecting which data from them.If people are not aware, they have no reason to act.Therefore, transparency is important.More transparency would not only enable people to act but also strengthen the validity of their choices.They could exercise their privacy rights more intentionally.Otherwise, site owners may try to argue that people's unawareness of what they are declaring, for example, by sending opt out signals [82,85], allows them to ignore people's requests as legally irrelevant.This argument should be preempted.Surfacing who is collecting which data may also have the side effect of motivating site owners to implement good privacy practices as their sites' behaviors become more visible and, thus, subject to regulatory scrutiny.In this context, regulators can use Privacy Pioneer to identify data collection practices of sites they may want to investigate.

Personalizing Ads and Preserving Privacy
Participants expressed a variety of preferences as to which data categories they would be willing to share and the categories of organizations that could receive the data ( §4.3.2). 53% of participants did not care much or at all if their visit of a website would be shared for ad purposes; the same is true for 52% as to their ZIP code locations and for 47% as to their interests (Figure 8).29% of participants would allow the sharing of interests with big tech companies, 23% with social media companies, and 23% with ad networks (Figure 9).These results suggest that making people aware of applicable privacy practices and giving them the choice to opt out does not necessarily mean the end of all personalized advertising.Opt outs could be selective based on data categories and organizations receiving the data.As participants were generally able to effectively navigate Privacy Pioneer's interfaces and to understand them ( §4), such a selective choice seems feasible.Overall, there is a place for personalized ads to the extent that any underlying data usage is transparent, with usable choice, and performed in a privacy-preserving way.

LIMITATIONS
Our privacy analysis methods and interfaces in Privacy Pioneer ( §3) as well as our usability study ( §4) are subject to various limitations: • Browser APIs: The browser APIs used by Privacy Pioneer ( §3. 1.2) are not available in all browsers, e.g, the API necessary for capturing HTTP response data, webRequest.filterResponseData, is only available in Firefox.However, this limitation would not apply if browser vendors would integrate the proposed functionality directly in the browser, which is ultimately our goal.• Encoded Data: While browser APIs allows us to capture unencrypted data, some data may be encoded.As a proof of concept we decode Hexcode SHA-256 and Base64 SHA-256 email address formats.For a comprehensive coverage, more encodings for more data categories would be necessary.
• Deterministic Analysis Methods: Privacy Pioneer's deterministic analysis ( §3.2.1) is limited by its rule-based nature and manual curation.E.g., URL list matching for identifying ad networks depends on the Disconnect Tracker Protection lists.Ad networks incorrectly added to the lists would be flagged while incorrectly omitted ones would not be.• False Positives and Negatives: Both Privacy Pioneer's deterministic and probabilistic analyses may lead to false positives and negatives (Tables 1, 2, and 5).To enable the identification of false positives Privacy Pioneer provides context in form of HTTP message snippets on which analysis results are based (Figure 2).The tradeoff between precision and recall is adjustable via hyperparameter tuning ( §3.4.5).Before relying on results, e.g., for regulatory enforcement actions, they should be verified manually.• IP Address Disclosure to IPinfo: To alleviate privacy and security concerns Privacy Pioneer processes all data locally except for the user's IP address, which is sent to IPinfo ( §3.1.4).Privacy Pioneer displays a notification about the IP address disclosure.IPinfo told us that they store no data beyond the IP address and the number of times it made a request.They also said that the data is kept for one year and neither used by IPinfo nor shared with any third party.It would be possible to implement such IP-to-location API from scratch [39], which is not our focus here.• Location Dataset Creation: We created our location dataset ( §3.3) by crawling the homepages of websites.Thus, it does not contain data from non-homepage pages.We further associated a site with a country based on its top-level country domain, which is only an approximation.• Interaction with Other Software: Depending on the use of ad/tracking blockers, VPNs, and other software, Privacy Pioneer's analysis results may differ.However, we are not aware of any interaction of Privacy Pioneer with any other software that would break Privacy Pioneer or such software.• Self-reporting and Positive Framing of Survey Questions: Answers to the questions of our usability study ( §4) should be interpreted in light of their nature as being self-reported.Also, for questions asking participants whether they agree with a statement, people generally feel more enticed to agree than to disagree.

CONCLUSIONS
Privacy Pioneer's analysis methods and privacy interfaces demonstrate that the analysis of websites' data collection practices can be accurately performed in the browser to surface dynamic privacy notifications.If people understand which data is being collected from them and by whom, they can use technical measures to better protect their data and meaningfully exercise their privacy rights.The increased transparency could also be a motivating factor for website operators to improve their privacy practices.It is our goal to make instruments of notice and choice more usable.Our usability study primarily evaluated study participants' first impressions and whether they could effectively navigate the interfaces we presented to them.Further exploration, such as a longer and non-directed usability study, is needed to determine how people would engage with the analysis results and interfaces in real life.It would also be interesting to analyze websites' data collection practices broadly across a set of sites and over time.Accounts for one instance of a period (.), plus sign (+), or dash (-) before the @.Also accounts for one instance of a period(.), or dash (-) after the @ myEmail.email@outlookbusiness.com

Figure 1 :
Figure 1: High-level architectural overview of the Privacy Pioneer browser extension for Firefox.

Figure 2 :
Figure 2: The popup interface showing that data was collected for monetization, location, and tracking purposes (left).After clicking the location card, detailed information about the collected data, including the context of the HTTP message, will be available (right).

Figure 3 :
Figure3: We created our location dataset by performing two web crawls during which we captured HTTP messages containing location data as detected by regular expressions.In the post-processing phase we masked the identified location data to avoid over-fitting the model and removed SVG paths from the dataset as those were clear false positives.We then sampled a random set of 5,472 HTTP messages and imported them into Doccano[20] for the subsequent annotation.Note that a site may generate multiple HTTP messages as it serves images, style sheets, scripts, and other resources.

Figure 4 :
Figure 4: Our model development and predictions at runtime.During training the model evaluated itself with the validation set.The probabilistic test set was held-out for the classification performance evaluation.

Figure 5 :
Figure5: We collected the HTTP messages from a random set of 50 sites from the Tranco list.We found a number of messages with over 100,000 characters (left).Those created substantial amount of work, defined as a percentage of the total characters that Privacy Pioneer would be searching through (middle).However, they only exhibited few instances of data collection, i.e., privacy labels created (right).

Figure 6 :
Figure 6: On a scale of 1 (not tech-savvy at all) to 5 (very tech-savvy), participants overall gravitate towards higher tech-savviness.

Figure 7 :
Figure 7: Participants' attitudes towards privacy on the web based on responses to survey questions Q2, Q3, and Q4.Most participants did not feel confident that their privacy is protected on the web (Q2).Participants were also generally hesitant to claim a good understanding of data collection and sharing practices (Q4).Many, however, felt that they can do something to protect their privacy (Q3).

Figure 8 :
Figure 8: Participants' data sharing preferences for ad purposes broken down by data category.To which extent participants' care depends on the category of data being shared.

Figure 9 :
Figure9: Participants' preferences for sharing interest data is organization-dependent.We asked participants to select all that applies or "None."

Figure 10 :
Figure 10: The degree to which participants found the privacy interfaces they encountered easy to understand (top) and useful (bottom) as well as their respective overall ratings of Privacy Pioneer.

Figure 13 :
Figure13: The percentage of participants that recommended each interface be directly integrated into the browser.We asked participants to select all that applies.

4. 3 . 5
Privacy Interfaces Should Be Easy to Use, Informative, and Actionable.We asked participants to share what they liked about our privacy interfaces and what improvements they would suggest.

Figure 19 :
Figure 19: Example of the privacy popup interface.

Figure 20 :
Figure 20: Example of the privacy history interface.

Figure 21 :
Figure 21: Example of the privacy watchlist interface.

Table 1 :
Classification performance of our deterministic classifiers running on the deterministic test set.F1 scores of at least 0.96 for all but one category indicate that the deterministic approach is reliable for the analyzed categories.

Table 2 :
Classification performance of our deterministic location data collection classifiers based on regular expressions and of our SVM-based location data collection classifiers running on the probabilistic test set ( §3.3.2).

Table 4 :
TinyBERT Multi BERT-Base Multi Distilled Multi TinyBERT Singles BERT-Base Singles Distilled Singles Classification performance of various models running on the probabilistic test set containing instances of city, latitude, longitude, region, and ZIP code (support  = 533).Singles are sets of models for classifying data for each location data category individually while Multi models classify data from all categories.The performance metrics are averaged across all five categories.

Table 5 :
Classification performance of the TinyBERT multitask model for the five location data categories running on the probabilistic test set.

Table 6 :
Participant demographics.100 participants completed our study.Some did not provide data for all categories: 2 for Age Range, 0 for Sex, 2 for Race/Ethnicity, 11 for Student status, 13 for Employment status, 0 for Browser, and 0 for Operating System.Percentages are adjusted for any omissions.All data is from Prolific except for Browser and Operating System, which we asked participants to provide in our survey.

Table 9 :
Watchlist input validation regex explanation and examples.8.2 TinyBERT Multitask Model Classification Performance in JavaScript

Table 10 :
Classification performance of the TinyBERT Multitask model after converting it to JavaScript running on the probabilistic test set.
8.3 Privacy Interface Screenshots