A Bilingual Longitudinal Analysis of Privacy Policies Measuring the Impacts of the GDPR and the CCPA/CPRA

Privacy policies are the main mechanism for websites to describe their practices in collecting and processing visitors’ personal data. Their format and content are subject to legal requirements that have changed due to recent new privacy regulations including the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), and California Privacy Rights Act (CPRA). Studying how privacy policies are adapted to such regulatory change can help identify shortcomings in implementing the law and inform future legislatory initiatives. Existing work in this area mostly studied effects of the GDPR on privacy policies or the “Do Not Sell My Personal Information” link mandated by the CCPA. Methodologically, insights were mainly drawn from English-language privacy policies using keyword-based analyses or machine learning classifiers. In this work, we address this research gap and conduct a bilingual study of privacy policies in English and German that investigates the effects of the GDPR and CCPA/CPRA on privacy policy content, using established methods from corpus linguistics that are language-independent and do not rely on keyword lists or classifiers that may date quickly. We find that, unlike for the GDPR, the CCPA’s requirements were not yet widely implemented when it first became enforceable but only with its amendment, the CPRA. Before that, websites used more than 60 variants of the “Do Not Sell” link instead of the mandated wording and did not prominently reference individual rights granted by the CCPA/CPRA. While companies outside California and the US did adapt their disclosures to the CCPA/CPRA, this was limited to English-language policies and did not spill over to policies in German. For GDPR enforcement, we find websites to increasingly rely on legitimate interests to justify data collection, raising concerns whether individuals’ interests in the privacy of their personal information are still sufficiently considered.


INTRODUCTION
In the light of pervasive data collection through digital services, including mobile devices, the Internet of Things (IoT), and the modern Web, recent years have seen jurisdictions across the globe update existing or pass new privacy legislation.Three prominent examples are the EU's General Data Protection Regulation (GDPR) [21] and, for the US state of California, the California Consumer Privacy Act (CCPA) [71] and its extension, the California Privacy Rights Act (CPRA) [72].Albeit quite different in scope and approach, the common goal of these laws is to create higher standards for the protection of personal information in an increasingly interconnected environment.Regulatory instruments for this include the requirement of a legal basis for data collection, transparency mechanisms that require companies to disclose their data processing practices, and providing people with individual rights regarding how companies process and use their personal information.
New privacy legislation coming into force provides researchers with the unique opportunity to study how service providers adapt to such regulatory change, to identify obstacles towards compliance, and to provide regulators with insights for future regulatory efforts.On the Web, the established approach of websites to inform about their privacy-related practices and let visitors acknowledge them are privacy notices, such as privacy policies and consent notices.When new privacy regulations such as the GDPR and CCPA became enforceable, companies updated their privacy policies to comply with new transparency requirements and inform customers about their data rights [15].Techniques used by privacy researchers to identify updates in online privacy policies include statistical analysis of text features such as changes in sentence, word, and syllable counts [7,44] or hashing of sentences and measuring their number of changes [15].Approaches to measuring content change, such as changes in data retention in privacy policies, include machine learning and deep learning classifiers [44] or searching for keywords related to the enforced privacy regulation [4,15,76].
While most of this prior work concerns changes in privacy policies around the GDPR enforcement date, there is a notable lack of work regarding the effects of the CCPA/CPRA.Existing research in this area has focused on how websites implement the Act's requirement to allow Californians to opt out of the sale of their personal information [57,75] and how people perceive these mechanisms [29,57], while, to the best of our knowledge, a more profound content analysis of post-CCPA privacy policies is missing.Additionally, the vast majority of existing work in privacy policy content analysis, including work not explicitly focusing on changes due to new legislation, only considered policies in the English language, which leaves privacy disclosure practices in large parts of the world underexplored.Only recently has the analysis of non-English privacy policies started to receive attention, with existing work either exploring privacy policies at one or two points in time [5,39] or focusing on descriptive statistics and the prevalence of specific key phrases [15] or corpus creation and annotation [2,14].
We address this research gap by conducting a diachronic bilingual content analysis of privacy policies in English and German, as these are the two most widely spoken languages in Europe as a first or second language [18] and we are familiar with both.Using established natural language processing (NLP) techniques, we revisit the enforcement of both the GDPR and the CCPA/CPRA and investigate how the content of websites' privacy policies has changed as these regulations became effective or enforceable, examining three corpora of privacy policies in English and German to assess how new privacy laws affect privacy disclosures on the Web.In summary, we contribute to privacy policy research as follows: • We investigate how the content of privacy policies changed under the GDPR and the CCPA/CPRA after they became enforceable (May 25, 2018 for GDPR; July 1, 2020 for CCPA) and effective (January 1, 2023 for CPRA) by examining wording, phrases, association strength, and legal terms.Applying the same methods to multiple privacy regulations coming into effect allows for direct comparison of their effects.• We provide further insights into CCPA adoption over time by analyzing variants of the "Do Not Sell" link and their evolution.We also provide first insights into how websites adapt to the CPRA.• To foster multilingual privacy policy analysis, we study the effects of privacy laws on privacy policy content in the two most frequently spoken languages in the EU, English and German, in a longitudinal analysis.While we find that policies in both languages adapt the terminology of new privacy laws, German policies more explicitly refer to concrete provisions, while English policies are more descriptive.On a methodological level, our results show that established NLP methods such as keyness analysis and topic modeling are well suited to identify prominent topics in both English and German privacy policies, as well as how they evolve over time, especially in reaction to new legislation.Our approach complements prior approaches to privacy policy content analysis, while not relying on lists of key phrases or deep learning models that would need to be updated or retrained if the regulatory environment changes.At the same time, our analysis supports the development of deep learning models by providing them with results for comparison with theirs.

BACKGROUND
As background for our work we outline the privacy laws of interest and the provisions expected to impact websites' privacy policies.
Legislative goals and process.In the EU, the General Data Protection Regulation (GDPR) [21] was the first in a proposed series of regulations to update and harmonize privacy laws across member states on a high level.It was passed in 2016 and became effective on May 25, 2018.The US state of California introduced the California Consumer Privacy Act (CCPA; Section 1798.100 of the California Civil Code [CCC] [71]) to strengthen the data privacy rights of consumers within state boundaries.It was passed on June 28, 2018, became effective on January 1, 2020, and became enforceable on July 1, 2020.Its guarantees were later expanded by the California Privacy Rights Act (CPRA) [72], a California ballot proposition approved on November 3, 2020, and took effect on January 1, 2023.Its enforcement was scheduled for July 1, 2023, but was delayed by a Californian superior court until at least March 29, 2024 [51].Despite their common goal of protecting the personal information of individuals, the CCPA/CPRA and the GDPR differ in several important points [35,58], including the scope of protected and regulated parties, the definition of personal data and the lawfulness of its processing, individual rights, and the dimension of fines.
Regulated entities.The GDPR and the CCPA/CPRA use similar characteristics to define the entities protected and bound by the respective laws but with differences in terminology and scope.The GDPR distinguishes between "data subjects, " "data controllers, " and "data processors." Data subjects are already identified or identifiable natural individuals regardless of their residency.According to Article 4 GDPR data controllers decide on the reasons for which and the methods by which personal data is processed.The data processor, usually a third party, processes personal data on behalf of the data controller and according to their instructions.The CCPA/CPRA uses the term "consumer" for the data subject, with the difference of including only Californian residents, while data controllers and data processors are "businesses" and "service providers, " respectively.
Applicability.According to Article 3, the GDPR applies to both data controllers and processors if they are established in the EU and process personal data or, if established only outside the EU, they offer services to data subjects in the EU.On the contrary, according to its Section 1798.140(c)(1), the CCPA/CPRA does not bind all businesses and service providers but only those who (1) do business in California (2) with Californian residents and (3) either (i) buy, sell, or share the personal data of at least 100,000 consumers or only collect the personal data of at least 50,000 consumers, or (ii) had a gross annual revenue of at least US $25 million in the preceding year, or (iii) generate at least 50 % of their annual revenue from selling or sharing personal information.
Permissibility of data collection and processing.Both laws fundamentally differ in their approach to under what conditions they allow the processing of individuals' personal information.Under the GDPR, the processing of personal data is only lawful if at least one of the six legal bases in Article 6 GDPR positively applies, two of which are freely given, specific, and unambiguous consent to the data processing and necessity for the data controller's legitimate interests.By contrast, the CCPA follows an opt-out approach in its Section 1798.120 by providing Californians with the right to opt out of the sale of their personal information.Section 1798.135establishes that consumers must be made aware of this right through "a clear and conspicuous link on the business's Internet homepage, titled 'Do Not Sell My Personal Information, ' to an Internet Web page that enables a consumer [...] to opt out of the sale of [their] personal information." Further, consumers' associated rights under this section must be described in an online privacy policy or "[a]ny California-specific description of consumers' rights" (Section 1798.135(a)(2)CCC).As "sell" is defined in Section 1798.140(t)(1) as communicating a consumer's personal information "for monetary or other valuable consideration, " a business sharing consumers' personal data for any benefit can be understood as a sale.Thus, this provision widely requires commercial websites to provide a "Do Not Sell" link and associated disclosures in their privacy policy.The CPRA amended Sec.1798.135(a)(1)CCC to require the wording "Do Not Sell or Share My Personal Information." It also introduced a new right to limit the use of sensitive personal information (SPI), including racial origin and ethnicity, religious, political, and philosophical beliefs, sexual orientation and activity, financial information, and health status and history.Companies must publish a second link on their home page titled, "Limit the Use of My Sensitive Personal Information, " which can be combined with the "Do Not Sell or Share" link into a "single, clearly-labeled link." Transparency requirements.The GDPR also lays down extensive transparency requirements for processors of personal data.Article 12 poses that data subjects need to be informed about the processing of their personal data "in a concise, transparent, intelligible, and easily accessible form, using clear and plain language."Article 13 specifies what information needs to be provided, including contact information, the purposes and legal basis for the processing, and the data subject's rights regarding their personal data.As IP addresses are considered personal data under the GDPR [20] and websites typically store them at least temporarily in web server logs, the requirement for a privacy policy and associated disclosures under the GDPR widely applies to websites.
Individual rights.The GDPR and the CCPA/CPRA grant individuals certain rights regarding knowledge and control of how companies use their personal data.These include the "right to know" what data companies have collected about them and the "right of deletion" of collected and processed data, which grants the "right to be forgotten."The GDPR's "right of rectification" initially did not appear in the CCPA but was introduced by the CPRA.It allows affected individuals to ask data controllers to correct inaccurate information or complement incomplete personal data.

RELATED WORK
In this work, we build upon earlier findings from web privacy measurements and privacy policy analysis to create a corpus of website privacy policies and study how their content evolved in response to the GDPR and the CCPA/CPRA.Privacy policy analysis.Previous work has extensively studied online privacy policies, including their prevalence [55], readability [48,64], and user perception [44].Recent research in this area has focused on automated content analysis, extraction, and summarization of data practices using natural language processing (NLP) and machine learning (ML) techniques [5,31,45,73,78], with some focusing on longitudinal aspects [1,76].One particular challenge is the high frequency of changes, which makes it challenging to trace the evolution of privacy policy content over time.Our work contributes to overcoming this challenge by comprising larger, bilingual corpora and extensive topic modeling.
Effects of privacy laws on privacy disclosures.Other work that more specifically focused on the effects of new privacy legislation on privacy policies was conducted when the GDPR came into effect in 2018.Degeling et al. [15] monitored changes in privacy policies on European websites over the course of 2018 and found an average increase in the prevalence of privacy policies of 4.9 % and an increase in the average length of 18.0 %.Content-wise, they identified an increase in the prevalence of GDPR-related terminology, especially that related to user rights and legal bases of processing, but did not conduct a more thorough content analysis.Linden et al. [44] analyzed changes in presentation, textual features, coverage, compliance, and specificity of 6,278 English-language privacy policies between January 2016 and May 2019.They confirmed an increase in average length and also found improvements in user experience, topic coverage, and specificity, though most of these improvements only concerned policies targeted at an EU audience.Wagner [76] conducted a longitudinal analysis of around 50,000 privacy policies from 1996 to 2021, using archival data and methods from ML and NLP to study data practices and the rights granted to users and reserved for companies over time.While she found some types of personal data to be less often collected after the introduction of the GDPR and the CCPA, there was an increase in the collection of location and implicitly collected data, as well as data sharing with unnamed third parties, and website visitors often lack a meaningful choice in how their personal data is used.
Multiple studies measured the prevalence of cookie consent notices as a more recent transparency mechanism for a website's data processing practices.They found that many notices do not offer sufficient choice to deny data collection, do not have a backend that properly implements the visitor's selection, or use dark patterns to nudge visitors into consenting to all data processing [15,47,56,74].
Effects of the CCPA on websites.The CCPA's requirement to provide a "Do Not Sell" link (see Section 2) was among the first effects of this law to be investigated by web privacy research.O'Connor et al. [57] manually and automatically analyzed popular US websites in July 2020 and January 2021 for how implementations of this requirement evolved after the CCPA became enforceable.They already found deviations from the mandated wording of the "Do Not Sell" link, which they partially attributed to deceptive purposes, but unlike this work they did not track the evolution of specific wordings over time.They also noticed widespread use of dark patterns to make the "Do Not Sell" link less visible on websites and, through two user studies, found that these techniques decreased interaction rates and hindered website visitors from exercising their right to opt out of the sale of their personal information.Van Nortwick and Wilson [75] measured the prevalence and implementation of "Do Not Sell My Personal Information" items on 497,870 English-language websites from the Tranco website ranking and a list of domains known to be third-party trackers or advertisers.They found "Do Not Sell" links on 9,838 sites, with a slow increase in adoption between July/August and November/December 2020, and partially attributed the low adoption rates to the CCPA not applying to the majority of websites (see Section 2).After the initially permissible alternative wording "Do Not Sell My Info" had been removed from the CCPA proposal in December 2020, the study found that only a few websites had updated their "Do Not Sell" links accordingly.The links were found to be often placed in website footers, where they are poorly visible and/or hidden from non-Californian visitors via dynamic link hiding.Our work adds to these findings by also investigating the prevalence and structure of alternative wordings for the "Do Not Sell" link and their evolution over time, including the new wording mandated by the CPRA.
Proposed approaches other than a link to implement the "Do Not Sell" requirement include icons [29] and browser-based mechanisms [26,79].For the latter approach, Global Privacy Control (GPC) [26], the California Attorney General has expressed that websites are legally obliged to treat the GPC signal sent by the browser as a "Do Not Sell" request under the CCPA [70].
Studying the effects of the CCPA beyond this requirement, Chen et al. [12] analyzed 95 privacy policies from popular websites across the United States to evaluate the clarity and effectiveness of CCPA disclosures.They concluded that information relevant to the consumer was often obfuscated and unclear.

APPROACH
In this work, we address these research gaps by conducting a bilingual diachronic analysis of how the GDPR and the CCPA/CPRA affected privacy policies on the Web.We examine and compare their content regarding textual characteristics and topics.This section provides an overview of our study design.We describe our preliminary CCPA study, which intended to give a first impression of how websites adapt to this law, followed by the methods to perform our main analyses for an in-depth investigation of the GDPR's and CCPA/CPRA's effects on privacy policy content.Figure 1 illustrates the used methods and data corpora.

Preliminary CCPA Study
In September 2019, we conducted a pre-study to understand if and how websites were preparing to adapt to the CCPA and whether they already contained CCPA-related privacy mechanisms and disclosures.For this pre-study, we investigated a combined set of domains from two different website rankings to get a broad first impression of websites' privacy practices: To account for local developments, we used the Alexa top list of popular websites [3] from September 2019, as Alexa provided website popularity rankings by region. 1 We added to the pre-study domain set the 500 most popular websites for the US states of New York and California, as well as those for Germany, Australia, India, and Israel.These specific regions were selected for their geographic variety and economic strength.To account for global popularity, we added to the pre-study set of websites the top 500 domains on the Tranco ranking [41] from September 27, 2019 (ID: 3QNL).At that time, Tranco had started to evolve into the most popular website ranking for research purposes.To simulate a resident in the location of each country-specific domain top list, we connected to VPN servers in the respective regions.Our setup on a university server automatically established a connection to the VPN server, performed website scraping, and  disconnected immediately after that task was finished.We used the Open Web Privacy Measurement (OpenWPM) framework [19] to crawl popular websites for privacy policies, CCPA-related web pages, and homepages.The homepages were searched for policy links with keywords pointing towards privacy policies in English and German.To identify links hinting at California privacy notices, we searched for URLs containing a combination of "California" and "privacy" or "California" and "right" in German and English.We filtered the downloaded web pages for duplicates, leading to a data set of 11,559 web pages belonging to 2,523 domains.The results of this preliminary study are described in Section 5.1.

Data Collection
Our main analyses use three different corpora, as described below.

Analyzed Corpora.
Investigating the impact of privacy regulations requires longitudinal corpora of privacy policies.In each of our analyses, we use distinct privacy policy corpora and/or domain popularity rankings, depending on the focus of the analysis.
GDPR Corpus.This multilingual corpus by Degeling et al. [15] provides a longitudinal snapshot of websites' privacy and cookie policies before and after the enforcement of the GDPR and was created to find evidence for GDPR-related changes on websites.For 28 European countries, the 500 most popular domains according to the Alexa website ranking were visited 15 times in the period between December 2017 and December 2018.Each month, one crawl was conducted, except for May, when three crawls were conducted to capture GDPR enforcement effects in a more finegrained way.The websites' homepages were searched for links containing specific keywords that typically occur in the links of privacy and cookie policies.The raw data set consists of 127,328 web pages with privacy statements in 24 different languages and provides the basis of our main study regarding GDPR effects.
CCPA/CPRA Corpus.To investigate CCPA-related modifications, we selected the most popular 100K domains of the research-oriented Tranco list [41] as of November 5, 2019 (ID: GVWK).From December 2019 to July 2020, we performed a total of 15 crawls, visiting each of these domains and downloading their website's homepage as well as any identified privacy or cookie policy.In a second set of crawls in February 2021 we visited the homepages of the top 10K Tranco domains using the list from January 31, 2021 (ID: WQW9) and downloaded their privacy and cookie policies.To capture the privacy policy landscape after the CPRA had taken effect, we conducted a third set of crawls in January and February 2023, revisiting the top 100K domains on the Tranco list from December 23, 2022 (ID: 829V9).For all crawls, we used a server located in California to simulate the geolocation of Californian residents instead of connecting to VPN servers.Overall, the crawls resulted in a raw data set of 1,458,802 privacy and cookie policy pages and 1,309,003 homepages as a basis for our main study of CCPA/CPRA effects.
Wagner's Corpus.The longitudinal corpus of privacy policies by Wagner [76] consists of 645,124 English-language privacy policies from between December 1996 and 2021, collected using the Internet Archive's Wayback Machine.The domains of this corpus were selected by combining the top 1K domains and randomly selected domains ranked between 1K and 10K from the Tranco lists from October 1, 2019 (ID: JL9Y) and March 31, 2021 (ID: ZLZG), as well as the Alexa top 1K domains for 2010-2021, Alexa top 500 between 2003 and 2009, and Alexa top 100 for 2002, resulting in a total of 4,997 domains.We use this corpus to compare and validate the results of our study on English privacy policy texts.For consistency, we extracted the privacy policies from this corpus that covered the same periods of time as each of the GDPR and CCPA/CPRA corpora, which yielded 101,181 privacy policies.

Domain Intersection.
We detected that the Tranco lists we used to create the CCPA/CPRA corpus were subject to fluctuations, i. e., domains did not consistently show up in the ranking over time.We found only 48.5 % of the top 100K domains on the used Tranco lists (see Section 4.2.1) of to be persistent (see Appendix F).To conduct a thorough longitudinal analysis, the investigated domain sets must be cleaned of such fluctuations.Hence, we created intersecting corpora, subsets for each corpus and language that only contain privacy policies of policy domains present at all points in time comprised by the respective longitudinal corpus.This procedure resulted in 479 and 138 intersecting domains for the English and German privacy policies of the GDPR corpus, respectively.Similarly, 1,946 and 42 intersecting domains were identified for English and German privacy policies in the CCPA/CPRA corpus.We applied the same method to the Wagner corpus, resulting in 542 and 655 intersecting domains for each time frame of collection of the GDPR and CCPA/CPRA corpora.Appendix C lists the top intersecting policy domains based on the Tranco list from December 22, 2022 (ID: 82V9V), which we used for our crawls in early 2023.While this process of cleaning fluctuations reduced the size of the corpora, it ensured the consistency of data over time for longitudinal analyses.

Text Preprocessing
For all corpora, we applied the best practices for privacy policy preprocessing identified in our earlier work [34].Following these, we used the Boilerpipe text extractor with the NumWordsRules-Extractor setting [38] to obtain the plain text of privacy policies from web pages, determined the languages of the texts by applying a majority voting scheme on the results of multiple language detection libraries, and identified non-privacy policies by applying trained classifiers [34] that achieved F1 scores of 99.1 % and 99.8 % for English and German.For each data collection time point, we manually inspected a random sample of 10 % of the downloaded policies and the final output for correctness and found no issues.
Previous research has shown that segmentation of legal texts requires more sophisticated approaches, as standard NLP toolkits are challenged by the complex structure of privacy policies [25,68].We obtained the best qualitative results for tokenization and part-ofspeech tagging for both English and German privacy policies from the SoMaJo library [62] combined with its part-of-speech tagger SoMeWeta [61].Manual inspection showed that they performed best among the tested NLP toolkits in (1) correctly stripping excessive punctuation, such as bullets of a bulleted list concatenated with the first token of a list item, (2) handling of punctuation symbols in references to laws and regulations, and (3) not splitting tokens containing intra-term hyphens by default.
For further sanitization, we used the Spacy library [52] for lemmatization and replaced email addresses, phone numbers, and URLs in the policy texts with placeholders using Textacy [17] and the token tags of SoMaJo and SoMeWeta.Privacy policies may also contain the names of brands and organizations.To remove bias, we replaced them with a placeholder.We identified these names using the named entity recognition (NER) functionality of Spacy [52], Stanza [63], and Flair [43] cumulatively and manually excluded falsely identified names.Appendix G shows examples for the preprocessing step.Finally, we removed duplicate policies for each data collection time point.Appendix L depicts the composition of the original and final sanitized corpora.The final number of privacy policies in English and German and the number of privacy policies for the intersecting domains in each corpus are shown in Table 1.

Text Mining
After preparing the corpora, we mined the privacy policy texts for keyphrases, co-occurrences, wordings, and relevant topics.

Keyness
Analysis.Keyness analysis aims to identify terms that stand out while comparing corpora.In corpus linguistics, the compared corpora are referred to as reference corpus (R) and target corpus (T).This analysis is performed by measuring whether the frequency of a term in the target corpus stands out statistically compared to its frequency in the reference corpus.In other words, the null hypothesis  0 is defined as there being no difference between the frequency of a term in the compared corpora.The typical statistical measure for this comparison is the log-likelihood ratio ( 2 ) value, computed as follows [9]: For our keyness analysis, we measure n-grams, with n ranging from 1 to 5. Occurring n-grams are counted only once per policy text, i. e., we consider n-gram types and not n-gram frequencies. 11 and  12 refer to the observed frequencies of an n-gram in the target and reference corpus, respectively.With  referring to an n-gram,  11 and  21 are the expected n-gram frequencies in the target and reference corpus and are calculated as follows: Since log-likelihood is sensitive to corpora of different sizes [77], we used the Bayesian Information Criterion (BIC), which is calculated as  =  2 −ln( ), where N stands for the combined number of n-grams in both corpora.A BIC value larger than 2 ( < 0.0018) indicates positive evidence against  0 , while a value larger than 10 ( < 0.0000024) indicates very strong evidence against  0 .In case of the existence of very strong evidence, an n-gram is considered to be more associated with the target corpus if  11 >  11 and the normalized frequency in the target corpus is higher.The normalized frequency refers to the frequency per million n-grams to ensure the comparability of corpora of different sizes.To remove noise for this analysis, we filtered n-grams starting and ending with connector words (e. g., then, too, . . .), containing placeholders for email addresses, phone numbers, or URLs, as well as n-grams with a lower normalized frequency than 10 per million.Unless stated otherwise, we report on n-grams with BIC > 10, indicating very strong evidence against  0 .
While statistical evidence of a difference in frequency is a necessary condition of keyness [77], it is insufficient to indicate prominence.While the BIC value indicates the presence or absence of statistical evidence against  0 , it is not a measure for effect size, i. e., the magnitude of difference between the normalized frequencies of an n-gram across the compared corpora [24].Hence, to complement this metric, we calculated the log ratio to determine effect size [30]: where   , and   , indicate the normalized frequencies of  in the target and reference corpus per million, respectively.Each additional point of the log ratio score signifies a doubling of the disparity between the two corpora for the considered n-gram.In case of the absence of a term, a tiny value (0.00000000000000000001) was considered instead.To provide the normalized rate difference of an n-gram in case of its absence in one of the compared corpora, we report the difference coefficient (DiffC), calculated as [33,42]: The difference coefficient ranges between +1 (if an n-gram appears only in the target corpus) and -1 (if an n-gram appears only in the reference corpus).A DiffC of 0 indicates no difference in the normalized frequencies of an n-gram in the compared corpora.Keyness analysis on corpora from the same field but different points in time can shed light on shifts in language and terminology usage.In our case we investigate such shifts with regard to the most significant terms associated with privacy policies before and after the regulatory regimes of the GDPR and CCPA became enforceable and the CPRA became effective.This required us to split our data into reference and target corpora at a specific point in time presumed to be a turning point in websites' decisions to adapt their privacy policies to regulatory change.Determining the turning point for this type of analysis is not trivial, as privacy policies might have changed several months before and after the enforcement dates of the respective laws.Previous work provides varying evidence of when websites started to adapt to new privacy legislation.For the GDPR, the points in time that saw the most changes in websites' privacy policies were found to be around the GDPR enforcement date in late May 2018 [4], one month before [15], and June 2018 [44].Wagner's longitudinal study found a peak in the total number of unique privacy policy texts for 2020 and attributed this to updates due to the CCPA [76].Despite these differences, the identified times of maximum change all hovered around the respective enforcement date, so we decided to use the final enforcement date of each regulation (or effectiveness date, if not enforced at the time of writing -GDPR: May 25, 2018; CCPA: July 1, 2020; CPRA: January 1, 2023) as the turning point to split our corpora into pre-and post-enforcement subcorpora for our analyses.In each comparison, the pre-enforcement subcorpus is the reference corpus, and the post-enforcement subcorpus is the target corpus.We used the CCPA/CPRA subcorpora from February 2021 and early 2023 as additional target corpora to present an updated picture of changes in the privacy policy landscape.

Co-Occurrence & Dependency
Analysis.Previous work has identified terms commonly associated with privacy policy texts and terms defined in privacy regulations.These terms were either identified manually, e. g., by reviewing privacy policies and regulations [15], or semi-automatically via guided topic modeling [37] or unsupervised topic modeling followed by consulting domain experts [69].Our analyses go beyond this and leverage a cooccurrence analysis from the field of corpus linguistics [9], which allows us to discover contiguous phrases with common terms in privacy policies such as "collect, " "process" and "share" to identify data practices exclusively associated with these terms in privacy policies.In our study, this analysis is not only limited to identifying the most significant co-occurrences but also changes in statistical collocation strength in privacy policies over time.This requires statistical measures that a) measure the exclusivity of the co-occurrences, b) are independent of corpus size, and c) result in scores that make future analyses comparable with the current state.Therefore, we chose the log Dice score [67], which is calculated as: where    is the number of co-occurrences of two words  and  in a predefined window of tokens and   and   are the number of occurrences of  and  in the corpus, respectively.The theoretical values of log Dice range between 0 and 14.
We parsed the privacy policies using Spacy to identify dependency bi-grams that are syntactically connected via a direct headdependent relationship, i. e., the direct object of a verb.These direct objects consist of noun chunks instead of single nouns to add semantic meaningfulness.Examples of such head-dependent bi-grams are collect_personal information and share_aggregated data.

Topic
Changes.The evolution of content in privacy policies over time is one of the less explored topics in privacy policy research.The traditional method to observe topic change over time is Dynamic Topic Modeling (DTM) as developed by Blei and Lafferty [8].Drawbacks of DTM include the lack of consideration of the appearance or disappearance of new topics and the requirement to predetermine the number of topics () in the corpus.We experimented on each of our corpora to determine  statistically [6,11,16,27] using the ldatuning package [54].The privacy policies were preprocessed as described in Section 4.3 and segmented into subtopic passages using TextTiling [32].The resulting number of passages per crawl and language are listed in Table 2.For each corpus and language, we trained 25 models using Latent Dirichlet Allocation (LDA), starting with 20 topics and increasing their number to 500.The resulting extreme values of four statistical tests on these LDA models converged between 440 and 500 topics as the optimal range, as shown in Appendix I.In the end, due to the aforementioned drawback of DTM, this traditional method was not able to infer a concrete number of topics and provide topical insights.Hence, we utilized a more suitable topic modeling method.
BERTopic [28] is an alternative to traditional topic modeling that does not require the number of topics in advance.In this method, each text is converted into vector embeddings using sentence-BERT [65], followed by reducing the dimensionality of these embeddings to cluster semantically similar texts.Dimensionality reduction and clustering are performed using the UMAP [50] and HDBSCAN [49] algorithms, respectively.To obtain the representative terms for each topic, a modified procedure for Term Frequency Inverse Document Frequency (TF-IDF) is applied that calculates the most important words per topic.This procedure, class-based TF-IDF, treats the passages inside a cluster as a single text.As a result, the score of each word  within a class (cluster)  is calculated as: where   refers to the frequency of word  within the cluster ,  represents the average number of words per cluster, and  indicates the frequency of word  across all clusters.
BERTopic allows for dynamic topic modeling that considers new topics appearing over time by first generating a general topic model.Then, the class-based TF-IDF representation is recalculated for each cluster  and time .This way, topic representations at each point in time can be calculated without the need to train  separate models.

4.4.4
Measuring CCPA-related Terminology.The CCPA states in its Section 1798.135 that a link with the exact wording of "Do Not Sell My Personal Information" must be present on homepages and privacy policy pages of websites.In the corpora, we searched for specific items satisfying this unique requirement.As first investigations hinted at the absence of this phrasing, we crafted the regular expressions in Appendix A to capture similar wordings.

RESULTS
In the following, we report on the results of the preliminary study, followed by the main study.We present the significant shifts in terminology, legal references, and user rights, as well as topic trends in privacy policies after the enforcement of the GDPR and CCPA and the evolution of CCPA/CPRA-related wordings on homepages.We also show the first effects of the CPRA in early 2023.

Preliminary CCPA Analysis
As described in Section 4.1, we accessed 2,523 domains from six specific locations to collect their privacy policies and search their homepages for "Do Not Sell" links.Overall, 8,305 privacy policies were retrieved, of which 488 -less than 6 % -contained CCPArelated disclosures, as determined by the presence of the string "CCPA".Table 3 provides an overview of the number of collected privacy policies by VPN server location and how many of them included CCPA disclosures.We cannot observe any strong trend that website visitors from California would see CCPA-related content in privacy policies more often than non-Californians.We also analyzed the presence of CCPA-related terminology in the privacy policies by websites' top-level domains (TLD).The highest prevalence was observed in privacy policies from .comdomains, where 424 out of 5,111 policies (8.3 %) contained CCPArelated disclosures.21 out of 442 (4.8 %) privacy policies with a .orgTLD and 4 out of 224 privacy policies (1.8 %) with the .co.il TLD featured CCPA content.The remaining TLDs included .de,.com.au, .us,and .ca.us, whose privacy policies did not contain disclosures related to the CCPA.
Finally, we searched the domains' homepages for links containing the phrase "Do Not Sell My Personal Information." At the time of this pre-study in October 2019, none of the inspected homepages had a link with this exact wording, hinting at websites not having taken preparations for the CCPA back then.
In October 2019, websites were not yet prepared for the CCPA: Although 5.8 % of the inspected privacy policies included CCPA-related disclosures, no homepage contained a link with the exact wording "Do Not Sell My Personal Information."

Keyness Analysis
As discussed in Section 4.4.1, the GDPR, CCPA/CPRA, and Wagner's corpora were split based on the enforcement / effectiveness dates of the GDPR, CCPA, and CPRA to find terms that occurred more often after the enforcement dates based on statistical evidence.

GDPR Enforcement.
The analysis of the English GDPR corpus demonstrated an increase in the usage of phrases that refer to the individual rights of data subjects under the GDPR, which aligns with the findings of previous work [76].Examples of such phrases are restrict_processing, object_to_processing_of_personal, right_to_ withdraw_consent, and rectification.Furthermore, the right to data portability (Article 20 GDPR) is reflected in the increased frequency of readable_format and machine_readable after the GDPR enforcement date.The increased prevalence of phrases such as compli-ance_with_legal and comply_with_legal_obligation hints at compliance with the law becoming increasingly important for data controllers.In addition, the log ratio of phrases such as perform_ contract (3.06), legitimate_interest (2.71), base_on_consent (2.18), contractual_obligation (0.94), and legal_obligation (0.67) provides evidence for how often data controllers ground their data processing on each of the legal bases for data collection or processing in Article 6 after the GDPR went into effect.For comparison against existing corpora, Table 4 lists further statistically significant phrases and their log ratio in the GDPR and Wagner corpora.
Comparing these insights with the most statistically significant increased phrases in the German privacy policies after GDPR enforcement paints a different picture.The most prominent phrases are references to individual GDPR.Examples include article_16, article_17, as well as many phrases containing article_6 and its paragraphs and enumerated subcases; detailed statistics are included in Appendix B. German privacy policies directly referencing GDPR provisions and those in English using a more descriptive approach could be rooted in different legal traditions: The common law system prevalent in the English-speaking world has legal precedent as its main source of law, while the civil law system that governs, among other jurisdictions, Germany and much of Europe, focuses on legal codes.Thus, legal texts in German are more likely to directly reference a law's individual provisions.In addition, more than 70 statistically significant phrases in the German corpus included the term "process", e. g., object to processing (log ratio 6.98), process restriction (3.30), legal basis for processing (3.04), which reflects the increased importance of transparency and accountability of data controllers about data processing after GDPR enforcement in German privacy policies.In comparison, we found around 20 such statistically significant phrases in English privacy policies.
After GDPR enforcement, German privacy policies more prominently referenced concrete GDPR provisions to provide a legal basis for data processing, while English privacy policies favored a more descriptive approach.

CCPA Enforcement
& CPRA Taking Effect.The n-grams listed in Table 5 indicate the frequency changes between the pre-CCPA and the July 2020 subcorpora, the latter of which was collected Comparing the pre-CCPA and February 2021 subcorpora yields a similar picture, which also holds for the Wagner-2021 corpus: Changes in the occurrence of CCPA-related terms are either not statistically significant or their log ratios do not differ from those of the July 2020 comparison.However, the comparison of the phrases between the pre-CCPA policies and those collected in Jan. and Feb. 2023 shows a substantial statistically significant increase in the occurrence of the four CCPA consumer rights, as well as for the two consumer rights newly added by the CPRA, the right to correction of personal information and the right to limit the use of sensitive personal information.Table 6 compares the metrics for these rights between the pre-CCPA reference subcorpus and the July 2020, Feb. 2021, and Jan. & Feb. 2023 target subcorpora, showing a statistically significant increase in the occurrence of these phrases in Jan. and Feb. 2023.Values with a  > 10 lead to rejecting  0 (no difference in frequency).The higher normalized frequencies in the target corpora (NFT) compared to the normalized frequencies in the reference corpus (NFR) and the positive difference coefficient values (DiffC) indicate a higher association with the three target corpora.This phenomenon could be due to the CPRA taking effect on January 1, 2023 and websites preparing for its enforcement, originally planned for July 1, 2023.These early adjustments to privacy policies to include consumer rights under the CPRA before its enforcement date indicate the willingness of businesses to comply with the CPRA.
CCPA consumer rights appeared significantly more often in English privacy policies in early 2023, especially the two consumer rights newly added by the CPRA.Such increases were not observable for CCPA consumer rights in July 2020.

Co-Occurrence & Dependency Analysis
Comparison of the co-occurrence strength in English privacy policies before and after the enforcement dates or, if not enforced at the time of writing, effectiveness dates of the GDPR and CCPA/CPRA as described in Section 4.4.2revealed intriguing insights.Similarly to Wagner's work [76], we identified an increased co-occurrence of collect_precise location data in the January & February 2023 privacy policies.We additionally observed an increased exclusivity in cooccurrence strength for, e. g., collect_voice data, use_algorithm-based technology, and personalize_child, which would require more indepth investigation.Due to space constraints we list the observed occurrences of common verbs in privacy policies in Appendix H.

Topic Changes over Time
In Section 4.4.3,we described how we applied BERTopic to identify topic changes in privacy policies over time.Table 7 presents the number of topics identified by BERTopic in each corpus.The most likely reason for the difference in the number of topics between the GDPR and CCPA/CPRA English corpora and Wagner's corpus is that the latter was compiled using a combination of top-ranked domains from the Alexa and Tranco lists, while the GDPR and CCPA/CPRA corpora used only Alexa and Tranco, respectively.In addition, our CCPA/CPRA English corpus extends to early 2023 and includes more intersecting domains and privacy policies.
For each corpus and language, we first summarize the most frequently emerging topics, independent of the time aspect.Then we look at the topic distribution over time to highlight past and current topical trends in privacy policies.Each topic distribution is L2-normalized to allow for easier comparison of the magnitude of change over time between the topics.The trends in topics identified for Wagner's corpus are included in Appendix J for comparison.5.4.1 GDPR Enforcement.As previously described, the GDPR corpus was compiled using data from December 2017 to December 2018.Examining the top 10 topics in the English GDPR corpus by frequency of occurrence, these topics cover common broad privacy practices such as cookies and using, sharing, and protecting personal information, as well as more specific topics and data processing purposes such as interest-based advertising, improvement of products and services, and promotional emails and text messages.
Looking at the effects before and after the GDPR enforcement date, Figure 2 depicts topics that follow a common trend.All of them had low occurrence prior to the GDPR enforcement date, which increased afterwards.One of these topics is "legitimate interests," which is one of the six legal bases of data collection and processing in Article 6 GDPR.What constitutes "legitimate interest" under the GDPR still requires interpretation by European courts and, consequently, still is the subject of recent research about deceptive design and potentially unfaithful data practices [36] five years after the enforcement of the GDPR [40].
In comparison, the overall top 10 topics in the German GDPR corpus differ and include cookies and their definition, opt-out cookies and links, the legal basis for pre-contractual data processing, and pseudonyms in user profiles.Companies' legitimate interests while protecting rights and freedoms of the affected individual are also among the trending topics after the GDPR enforcement date, as shown in Figure 3, along with the legal basis for processing personal data prior to entering a contract (Article 6(1)(b) GDPR).Reviewing the corresponding policy text clarified that companies use this legal basis to be able to communicate with customers and process their personal data prior to establishing a contractual relationship or to conduct credit investigations before providing financial services.

CCPA
Enforcement & CPRA Taking Effect.The top 10 topics in the English CCPA/CPRA corpus, which covers the time between December 2019 and February 2023, reveal a different pattern.These topics include updates to privacy policies, security measures, the purpose of using information, functional cookies, third-party hyperlinks, and the potential requirement to disclose information to law enforcement authorities.Regarding the effects of CCPA/CPRA enforcement, we found that more topics in relation to these regulations have been included since December 2019.This supports our previous observation in Section 5.2.2, a significant increase of phrases in privacy policies referring to CCPA/CPRA-related rights.Figure 4 shows this upward trend for CCPA/CPRA-related topics that include core principles of these laws, such as the option to opt out of the sale of personal information, response time and format of verifiable consumer requests, and individual rights of Californians.A concerning trend is the continuous rise of referrals to "legitimate interests" also found in prior work [76].The top 10 topics for the German privacy policies in the CCPA/ CPRA corpus include the usage and definition of cookies; the rights to access, rectification, and limitation of data processing; the double opt-in process for newsletter registration via confirmation emails; and encryption of personal data sent through contact forms.Another top 10 topic regards opt-out cookies, whose occurrence has been decreasing since February 2021, as shown in Figure 5.The likely cause is the new German Telecommunications-Telemedia Data Protection Act (German abbr.: TTDSG) [10], which came into force in December 2021.Implementing the EU's ePrivacy Directive, Section 25 TTDSG only allows storing (or accessing already stored) information on an end user's device if the user has provided consent based on clear and comprehensive information, unless storing or accessing the information is, from a technical perspective, strictly necessary to provide a service explicitly requested by the user.The requirement for active, informed consent makes opt-out mechanisms not TTDSG-compliant [66], which explains privacy policies mentioning opt-out cookies less often.
Passages referring to legitimate interests of data subjects have been trending in English privacy policies since 2018.German privacy policies have been mentioning opt-out cookies less frequently since February 2021, possibly due to the TTDSG, which became effective in Germany in December 2021.

"Do Not Sell" Link Over Time
In our main study, the search for "Do Not Sell" links on the homepages of the domains analyzed for the CCPA/CPRA corpus paints a different picture compared to the pre-study.Table 8 shows an increase in the appearance of "Do Not Sell My Personal Information" over time, particularly a monotonous increase of 1.85 percentage points from December 2019 to the second half of January 2020, and a total of 3.57 percentage points to July 2020.The most likely reason could be the initially declared CCPA enforcement date of January 1, 2020 and its postponement to July 1, 2020.Between July 2020 and January 2023, another 4.61 percent of the homepages included the wording required by the CCPA.
We also inspected the homepages for the link wording mandated by the CPRA, which added the aspect of sharing: "Do Not Sell or Share My Personal Information." Before 2023, no homepage in our CCPA/CPRA corpus contained this required link, while 1.85 % of homepages had added it in January 2023 and 3.12 % in February 2023.This indicates that website owners were gradually adapting to the CPRA's requirements and preparing for its enforcement.
Although many websites used to use different wordings for their "Do Not Sell" link in 2020, the prevalence of the exact wording stipulated by the CCPA/CPRA increased with each crawl, while nonstandard wordings gradually disappeared from homepages.We observed 60 wordings, listed in Appendix D, which differed in 1) use of acronyms or abbreviations ("PI" for "personal information" or "info" for "information"), 2) substitution of the term "information" with "data," 3) appending words referring to the legislation mandating this link or its applicability, such as "CA, " "California, " or "CCPA, " and / or 4) capitalization of all characters.
In early 2023, websites had started to adopt the CPRA wording for the "Do Not Sell" link, while back in 2020 they used 60 different wordings instead of the one mandated by the CCPA.

Global Effects of the CCPA/CPRA
As outlined in Section 2, the CCPA/CPRA could affect any company that does business in California or deals with Californians, even if it is based in another US state or country.Consequently, we were interested in how many companies in our CCPA/CPRA corpus resided in California or offered services to Californians from elsewhere and had prepared themselves for the CCPA/CPRA coming into effect.As prior work has identified "spillover effects" of the GDPR on privacy practices in other jurisdictions [5], this raised the question whether the CCPA/CPRA have also led to changes in privacy policies in languages other than English, in our case German.As a metric for whether a company had adapted its privacy disclosures to California regulations, we used the presence of "Do Not Sell" mechanisms.To determine where companies were based, we used the Free Company Dataset from People Data Labs [59], which contains metadata on 12 million companies in the world, including company name, website, country, and region.3,138 of the 4,674 domains (67.1 %) in our English-language CCPA/CPRA subcorpus were listed in the Free Company Dataset.This does not indicate a shortcoming of this data set, as not all domains necessarily belong to companies.509 of these 3,138 domains (16.02 %) were linked to companies located in California.None of the 56 domains in the German CCPA/CPRA subcorpus belonged to companies from California.The higher number of domains compared to intersecting privacy policy domains is due to companies owning multiple TLDs, such as Google, Blogspot, or ESPN.Redirects to privacy policies with different domains are also common [15].
Our analysis shows that companies with German privacy policies did not contain "Do Not Sell" links on their homepages or in their privacy policies at any time.However, many companies with English privacy policies inside and outside the US in early 2023 included "Do Not Sell" links on their homepages or declared not to sell or share personal information in their privacy policies.Outside the US, we identified 153 such companies from 35 countries around 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.22 0.39 the world, including 48 in the UK, followed by 17 companies in Germany, 14 in Canada, 11 in India, and 6 in France and the Netherlands, as illustrated in Figure 6.Most of them are in industries such as the Internet, computer software, IT and services, publishing, online media, and marketing and advertising.As previous work has suggested [75], these types of companies residing outside the US are more likely to target an international audience and, thus, to collect personal data from Californian residents, as opposed to, e. g., the real estate industry.We made similar observations for companies within the United States, where we identified 489 companies with attributable states and 20 with non-attributable states in early 2023 with "Do Not Sell" links on their homepages or declarations not to sell or share personal information in their privacy policies.As expected, most of these companies (158) resided in California, followed by companies in New York (104), Massachusetts (27), Texas (22), and Washington (20).The higher number of "Do Not Sell" links and declarations on New York companies' websites compared to other US states could be explained by New York being the country's finance and investment center, while California is a tech hub, both attractive environments for company headquarters.
With regard to the definition of companies required to comply with the CCPA/CPRA (see Section 2), a relevant limitation of the Free Company Dataset is the lack of annual revenue and the unique number of individuals affected by data collection.Therefore, we cannot investigate the compliance of companies with CCPA/CPRA.The CCPA/CPRA had an impact not only on transparency regarding the data practices of American companies but also on those of companies in 35 other countries around the world, as evidenced by the presence of "Do Not Sell" links on their websites in early 2023.

DISCUSSION & LIMITATIONS
In this section we discuss our findings, their implications for future privacy policy analyses, and limitations of our approach.Diachronic bilingual privacy policy analysis.By applying statistical and modern analysis methods and dividing the English and German GDPR and CCPA/CPRA corpora and the English Wagner corpus by the dates of three important privacy regulations, we strengthened the extensive findings of previous research and discovered new trends and topics in English privacy policies by comparing them with our findings in German privacy policies in fixed sets of domains.The bilingual comparison of the two most commonly used languages in the European Union sheds light on language-specific developments, such as English privacy policies increasingly referring to legitimate interests, while German privacy policies abandoned the concept of opt-out cookies.To the best of our knowledge, our work is the first to provide longitudinal insights into the changes in German privacy policies over a data collection period of almost two years, though our available German data is still too limited for a thorough comparison of CCPA/CPRA effects.We also provided first insights into the CPRA taking effect.
Benefits of our methods.Measuring whether predefined terms from multilingual word lists occurred in privacy policies [15], as well as applying trained deep learning classifiers based on the annotated OPP-115 corpus from 2016 [1,13,31,44,53,76] are well-established methods in privacy policy analysis.The keyness analysis employed in this paper is language-independent and allows for closer investigation of changes in privacy policies independent of (incomplete) word lists or machine-learning models which might output false positives and negatives or require discarding trained models due to low precision [76].Therefore, this method withstands the test of time and can be employed in future research.Our settings for keyness analysis are the strictest in the field of corpus linguistics [46,77].These settings include the defined value of 10 for the BIC and a normalized minimum of 10 occurrences per million n-grams.A unique setting of our keyness analysis is to consider the occurrence of each n-gram at most once per policy, which enables us to map the number of occurrences of each n-gram to the number of inspected policies.While LDA topic modeling has been used to identify new topics in privacy policies after GDPR enforcement [69], we used a modern topic modeling technique over time based on sentence-BERT to discover new trends in privacy policies, thus providing privacy policy researchers with new tools.
Our recommendations.Addressing regulators and enforcement agencies, we strongly suggest that future regulations incorporate more concrete requirements for implementing transparency and control mechanisms.New regulations should be accompanied by additional guidelines in non-legal language.Concrete examples or even sample code could provide further guidance.As evidenced by our initial observation of 60 wordings for the "Do Not Sell" link, this still does not guarantee that affected businesses are initially legally compliant, but would head towards providing them with practical and non-ambiguous guidelines.In this light, CPRA guidelines now allowing for unspecified combinations of the two links could make compliance more difficult for websites.Another example of how more concrete legal requirements could boost compliance is the possible effect of the TTDSG on German privacy policies regarding opt-out cookies, as the GDPR had deliberately not clarified the usage of cookies, but deferred them to a future ePrivacy Regulation.We recommend that affected businesses ask regulators to provide clear and practical guidance on how to implement legal requirements concretely, as these businesses are directly bound by data protection legislation.
Limitations of our analysis.Although our focus on intersecting privacy policy domains (see Section 4.2) restricts the size of our data set, this allows for a thorough comparison of changes over the enforcement of three crucial privacy regulations.The small set of domains for the CCPA/CPRA corpus is, besides the aforementioned fluctuations in the used Tranco lists, caused by our February 2021 data collection, in which only the top 10K domains of the Tranco list were visited, in contrast to the top 100K domains for all other website crawls.Moreover, we are aware of the data gap in our CC-PA/CPRA corpus for the second half of 2021 and 2022.Filling data gaps from archives like the Internet Archive's Wayback Machine was not feasible, as we collected the homepages and privacy policies using servers in California to simulate the location of Californian residents.However, the analyses have shown clear longitudinal results and trends for the privacy policies of consistently present domains, and we do not expect the privacy policies of domains only occasionally appearing on the Tranco lists during this data gap to have a significant influence on our results.
Code availability.During the longitudinal analysis of our corpora, we developed customized code and expanded existing libraries to perform our analysis.To enable the privacy policy research community to perform similar analyses for the enforcement of future privacy regulations, we make our code available on GitHub2 .

CONCLUSION & FUTURE WORK
In this work, we analyzed how the enforcement of the GDPR and CCPA and the CPRA taking effect influenced the language of privacy policies.We conducted text and topic modeling analyses based on modern linguistic standards, providing more details into their effect sizes in privacy policies while confirming previous findings that the enforcement of the GDPR was reflected in the privacy policies around the time it came into effect.
Our findings indicate that for the CCPA significant changes in the texts and topics of privacy policies mainly occurred after the CPRA had become effective on January 1, 2023.Earlier, we had observed widely differing wordings for the "Do Not Sell (or Share) My Personal Information" link, while over time the wording mandated by the CCPA/CPRA had become more widespread.This illustrates that, even when laws clearly state very specific requirements, companies can find it difficult to implement them in practice, which illustrates a need for regulators to provide further guidance.The topic modeling over time showed a gradually rising and concerning trend in the usage of "legitimate interests" since 2021 as the legal basis of data processing in privacy policies.
At the time of writing, the CPRA (California Privacy Rights Act) is due to be enforced in 2024.Other US states that have started to follow suit and passed their own state privacy laws that will soon become effective include Utah, Colorado, Connecticut, Virginia, Iowa, Indiana, Tennessee, Montana, and Texas [60].We encourage privacy researchers to draw inspiration from our work and collect longitudinal data to observe the effect of these legislations on companies' privacy practices, including their websites and privacy policies, and observe how they affect privacy policy language.

A REGULAR EXPRESSIONS TO DETECT "DO NOT SELL" LINK VARIANTS
The following listing shows the regular expressions that we used to search for "Do Not Sell or Share" links on homepages and in privacy statements. Listing

B GDPR ARTICLES IN GERMAN PRIVACY POLICIES
The following table shows the keyness statistics for references to GDPR articles in the German GDPR corpus after the GDPR enforcement date compared to privacy policies before that date.BIC = Bayesian information criterion, LR = log ratio, PercDiff = percentage points difference, DiffC = difference coefficient, NF = normalized frequency per million in T (target corpus) and R (reference corpus).

C TOP INTERSECTING DOMAINS
The following table presents the top domains in our CCPA/CPRA corpus for both English and German that we found to be intersecting over all data collection time points, ranked and based on the Tranco list.

G EXAMPLES OF PRIVACY POLICY PREPROCESSING
Tables 14 and 15 show examples of how we preprocessed the text of the privacy policies as described in Section 4.3; more concretely, for the topic modeling analysis.In the English example below, company names and an email address were replaced with the placeholders "COMPANYNAME" and "REPLACEDMAIL, " respectively.In the German example on the next page, additionally a URL was replaced with "REPLACEDURL." The white spaces before and after the placeholders are put intentionally to prevent accidental concatenation of words with punctuation or placeholders.They do not cause any problems with text processing.The line breaks in the preprocessed text indicate the result of the TextTiling algorithm, i. e., the point where the text segment was split into two tiles.Lemmatization was not applied to the input texts of BERTopic, as this might have resulted in imprecise sentence-BERT embeddings.For the keyness analysis, the texts were afterwards lemmatized with the Spacy library and tokenized with the SoMaJo and SoMeWeta libraries as described in Section 4.3.

H NOUN CHUNK BIGRAM DEPENDENCIES
In the following we report the dependency bi-grams that a) occur in the compared sub-corpora, b) are statistically significant in the target corpus (Fischer's exact test [23],  < 0.05 with Benjamini-Hochberg correction [22]), c) appear at least ten times per million in the target corpus and d) have a minimum positive change in log Dice value of 1, equaling to a doubling of co-occurrence exclusivity strength [67].share your personal information, user-generate content, your location, de-identify information, any personal data, non-personally identifiable information, video, personal information, any personal data, your usage activity, your photo, your data, your phone number, my personal information, non-personal information, any sensitive personal information, your name and mailing address, visitor' personal information, photo, certain device identifier, subscriber record, their contact list, hashed version use remarketing service, facial recognition, their own tracking technology, your payment card information, location-base service, local device storage, payment information, different technology, personal information, your postal mailing contact information, fully automate algorithm-base technology, web chat service, optional service, location information, professional or employment relate personal information, device's setting app, facial recognition technology, automate system, tracking technologies, digital service, unique identifier, location-base service, your content, research, usage data, http cookie, publicly available information, publisher network websites, authorized agent, certain information, visit information, sensitive personal information, functionality, previously collect information

K EXTENDED RESULTS OF THE KEYNESS ANALYSIS
For completeness and due to space constraints, we present more results of the keyness analysis (Section 5.2) in the following.

Figure 1 :
Figure 1: Overview of the used methods.

Figure 2 :
Figure 2: Topic trends in the English GDPR corpus.

Figure 3 :
Figure 3: Topic trends in the German GDPR corpus.

Figure 6 :
Figure 6: The distribution of companies by country in the world (left, N=153) and US state (right, N=489) whose websites' homepages or privacy policies contained "Do Not Sell" links or statements in early 2023.

Figure 9 :
Figure 9: Sample outputs of determining the number of topics for the English corpora.

Table 1 :
Number of unique privacy policies in each corpus.

Table 2 :
Corpus stats on the number of privacy policies (PP), passages (Psg), and average (Avg) passages per policy.Only policies of intersecting domains over each corpus are listed.

Table 3 :
Privacy policies collected from 2,523 domains in the preliminary CCPA analysis, categorized by access location.

Table 4 :
[76]ratio values of English phrases with statistically significant occurrence increase after GDPR enforcement.An extended version is included in Appendix K. the CCPA enforcement date of July 1, 2020.The log ratio values of the phrase opt_out_of_sale and right_to_opt indicate an increase in 26.1 and 20.4 percentage points, respectively, and are relatively small compared to the observed effect sizes after GDPR enforcement.The occurrence of californian_resident has increased by 14 percentage points, which is comparable to Wagner reporting a 20 percentage point increase in mentioning Californians[76].In the German policies, we could not observe any terms with statistically significant changes after the CCPA enforcement date. after

Table 5 :
Log ratio values of English phrases with statistically significant occurrence increase after CCPA enforcement.

Table 6 :
Keyness statistics for the phrases related to CCPA/ CPRA consumer privacy rights in the English corpora.

Table 7 :
Number of topics identified by BERTopic in each corpus.The number of topics was set to auto and the number of minimum documents per topic to 20.

Table 8 :
Prevalence and evolution of the most common wordings for the "Do Not Sell" link on websites' homepages in the CCPA/CPRA corpus over time.The full table listing all discovered wordings can be found in Appendix D.

Table 9 :
Occurrence of legal references in German privacy policies after the GDPR enforcement date.

Table 10 :
Top-ranked intersecting privacy policy domains of the German and English CCPA/CPRA corpora based on the Tranco list from December 22, 2022 (ID: 82V9V).

Table 13 :
Ranking fluctuations in the top 115 domains of the Tranco lists used in this paper.

Table 14 :
Original and preprocessed text fragments of the English privacy policy of yahoo.com in February 2023.Raw Text as Extracted by Boilerpipe Preprocessed Text Authorized Agent\nYou may use an authorized agent to submit a request to opt-out of sale, request to know, request to correct, or request to delete on your behalf.If you choose to use an authorized agent to exercise any such rights under the CCPA, you will need to provide the authorized agent written signed permission to act on your behalf.Please direct your authorized agent to email us at california_ privacy@yahooinc.comwherethey will receive instructions on how to submit a request on your behalf.\nAuthorizedAgentRequest to Opt-Out of Sale\nUsers with Registered Accounts\nAuthorized agents submitting a request on a user\'s behalf to request opt-out of sale of the user\'s personal information must provide Yahoo evidence of the authorized agent\'s power of attorney or an authorization signed by the consumer showing the agent is authorized by the consumer to act on the consumer\'s behalf.\nUserswithoutRegistered Accounts\nRequests to opt out of sale for non-registered users must come from the device on which the user wishes to opt out of sale.As a result, an authorized agent will need to submit the request to opt out of sale from the applicable device.Due to the nature of Yahoo\'s services, we are only able to opt a non-registered user out of sale if such user takes such action on the device on which the user wishes to opt out.\nAuthorizedAgentRequest to Know or Request to Delete\nIf you choose to use an authorized agent to exercise your request to know or request to delete on your behalf, Yahoo will require you to verify your identity directly with Yahoo and confirm directly with Yahoo that you provided the authorized agent permission to submit the request on your behalf.\nIfyouhave provided the authorized agent with power of attorney pursuant to California Probate Code sections 4121 to 4130, the above instructions do not apply.Prior to releasing any personal information to the authorized agent or honoring a deletion request, Yahoo will require verification from the authorized agent of power of attorney to act on your behalf.\nWemaydeny a request from an authorized agent that does not submit proof that they have been authorized by you to act on your behalf.Authorized Agent You may use an authorized agent to submit a request to opt-out of sale, request to know, request to correct, or request to delete on your behalf.If you choose to use an authorized agent to exercise any such rights under the CCPA, you will need to provide the authorized agent written signed permission to act on your behalf.Please direct your authorized agent to email us at REPLACEDEMAIL where they will receive instructions on how to submit a request on your behalf.Authorized Agent Request to Opt-Out of Sale Users with Registered Accounts Authorized agents submitting a request on a user's behalf to request opt-out of sale of the user's personal information must provide COMPANYNAME evidence of the authorized agent's power of attorney or an authorization signed by the consumer showing the agent is authorized by the consumer to act on the consumer's behalf.Users without Registered Accounts Requests to opt out of sale for non-registered users must come from the device on which the user wishes to opt out of sale.As a result, an authorized agent will need to submit the request to opt out of sale from the applicable device.Due to the nature of COMPANYNAME's services, we are only able to opt a non-registered user out of sale if such user takes such action on the device on which the user wishes to opt out.Authorized Agent Request to Know or Request to Delete If you choose to use an authorized agent to exercise your request to know or request to delete on your behalf, COMPANYNAME will require you to verify your identity directly with COMPANYNAME and confirm directly with COMPANYNAME that you provided the authorized agent permission to submit the request on your behalf.If you have provided the authorized agent with power of attorney pursuant to California Probate Code sections 4121 to 4130, the above instructions do not apply.Prior to releasing any personal information to the authorized agent or honoring a deletion request, COMPANYNAME will require verification from the authorized agent of power of attorney to act on your behalf.We may deny a request from an authorized agent that does not submit proof that they have been authorized by you to act on your behalf.

Table 16 :
Sanitized lemmatized form of the noun chunk dependents with the highest increment in relative frequency per million in the post-GDPR corpus (after GDPR enforcement in May 2018)., image, sensitive data, search, image, voice data, financial transaction data, unique identifier, diagnostic data, account data, name and contact data, technical data, personal data, user data, any special category, your personal data, information and report website usage statistic, functionality cookie, web traffic data, billing information, uri address, certain data, traffic data, specific location, any sensitive information / special category, email address, limited business contact information, usage data, your address book and calendar meeting information, "email header" information send promotional communication, your activity history, specific instruction, sms, short snippet, your entire document, customer data, content, customer communication, marketing material, marketing email, your personal information, direct marketing communication, query, letter, service message, your personal data, third-party direct marketing communication, our newsletter, assessment, your contact detail, special signal, service communication, data subject access request, important service-relate message, renewal notice, important notice, your user information, some notification, account deletion request, singular email share your data, limited account, limited aggregated information, some de-identify data, your confidential information, snippet, all, non-public additional information, your personal data, certain data, personal data section, your feedback, video, aggregated data, certain amount, your email, user location data, aggregated insight, aggregated data, your contact detail, their name, member, your thought, payment partner, my data, agent, services, cookie use your personal data, your work, automate process, various tool, app, third-party service, automate system, contact information, browser-base cookie control, particular word, de-identify device, tracking protection, connected service, browser, query, your keyboard, your voice data, control, personalization, global navigation satellite system, purchase, your calendar, third-party app, product, performance, account, web beacon, similar technology, tool, device, our products, our data request feature, global navigation satellite systems, content

Table 17 :
Sanitized lemmatized form of the noun chunk dependents with the highest increment in relative frequency per million in privacy policies after CCPA enforcement in July 2020.

Table 18 :
Sanitized lemmatized form of the noun chunk dependents with the highest increment in relative frequency per million in privacy policies in February 2021 compared to the pre-CCPA corpus (enforcement in July 2020)., your direct input, access information, user information, follow required diagnostic data, follow additional information, california consumer privacy act, persistent identifier, diagnostic data, customer information, text, sample, contact and payment data, require diagnostic data, financial transaction data, installation date, performance, usage data, device and usage data, image, public information, search, certain category, personal data, child's online contact information, de-identify location data, "email header" information, motion activity, hardware capability, unique cookie id, usage and system operation data, any special category, additional website usage data disable other analytic tool, access, your account, ability, app's use, connect service, personalized feature, certain type, camera app's access, local device storage, interest-base ad, any user identification code, option, syncing, your access, feature, notification, display, automatic content recognition, see ad enable archiving, web application, optional diagnostic data, functionality, your web experience, share usage data, inclusion, virtual reality experience, certain account feature, tool, bulk consent feature, motion, fast loading, restore, other people, market research, voice command, mixed reality experience personalize advertising, publisher site, your feed or job recommendation, child reject cookies, change send your activity history, service announcement, your voice data, periodic promotional or informational email, diagnostic data, invitation, e-mail address, unique browser id, informational message, specific instruction, unsolicited commercial email, link, notice, direct message, informational communication, feedback, information request, connection request, spam, service notification, text message, report, location data, unsolicited email, notification, log, your search query, copy, error report, information, relate communication, instruction share result, account data, your profile data, your location, anything, your account, some de-identify data, all category, your e-mail address, your own list, non-personally identifiable information, common behavior, optional diagnostic data, all, you control, relevant data, video, common attribute, insight, publisher content, those, offer, relevant third party, limited, aggregated information, link, your phone screen, identification, common component, service provider, your phone number, your private personal data, non-personal information, our service, any sensitive personal information, someone else's creative content use your contact, automate process, multiple account, payment instrument, facial recognition, laptop, clear browse history, appropriate safeguard, your payment card information, local device storage, payment information, device-base recognition, your postal mailing contact information, web form, publishers' site, your tweet, subscription services, voice and text data, professional or employment relate personal information, device's setting app, automatic scan technology, opt-out tool, facial recognition technology, member' data, your keyboard, automate system, first name, digital service, error report, unique identifier, account, device-base speech recognition, log data

Table 19 :
Sanitized lemmatized form of the noun chunk dependents with the highest increment in relative frequency per million in privacy policies in January and February 2023 after the CPRA taking effect compared to the pre-CCPA corpus (enforcement in July 2020)., age, your direct input, process, certain personal data, geolocation data, california consumer privacy act, precise location data, persistent identifier, diagnostic data, identifier, payment and billing information, contact data, anonymous usage data, sensitive personal information, device identifier, purpose, website, commercial information, child's online contact information, metadata, unique cookie id, limited information, traffic, device information, business or commercial purpose, billing, category disable feature, option, app's use, connect service, personalized feature, your account, local device storage, syncing, notification, camera app's access, some cookie, automatic content recognition, functional cookie enable web application, secure login, gps feature, our advanced security setting, functionality, purpose, website, virtual reality experience, basic feature, collection, your continue use, employment and education data, fast loading, our user, marketing use, interest-base content, contact, feature, other people, specific functionality, service, voice chat, market research, gps location-base service, voice command, mixed reality experience personalize our communication, information, your use, child reject cookies send personalized offer, your voice data, any personal data, newsletter, invitation, advertising, electronic communication, informational message, link, service-relate message, email marketing, informational communication, transactional and administrative email, important update, marketing and promotional communication, service-relate communication, event invitation, spam, text message, periodic email, administrative email, transactional message, unsolicited email, notification, transactional email, compliance, commercial email

Table 20 :
Log ratio (LR) and percentage difference (PercDiff) values of English phrases with a statistically significant increase in occurrence after GDPR enforcement.