Evolution of Composition, Readability, and Structure of Privacy Policies over Two Decades

Privacy policies outline data collection and sharing practices followed by an organization, together with choice and control measures available to users to manage the process. However, users have often needed help reading and understanding such documents, regardless of their being written in a natural language. The fundamental problems with privacy policies persist despite advancements in privacy design, frameworks, and regulations. To identify the causes of privacy policies being persistently challenging to comprehend, it is vital to investigate historical policy patterns and understand the evolution of privacy policies concerning information packaging and presentation. To this aid, we create a sentence-level classifier to conduct a large-scale longitudinal analysis on different privacy policies from 130, 604 organizations, totaling approximately one million policies from 1997 to 2019. We annotate 10, 717 sentences from 115 policies in the OPP-115 corpus to implement the classifier and then use those annotations to train the XLNet and BERT classifiers. Results from our analysis reveal that specific data practice categories experience more frequent policy changes than others, making it challenging to track relevant information over time. In addition, we discover that every category has distinct composition, readability, and structural issues, which exacerbate when categories frequently co-occur in a document. Based on our observations, we provide recommendations for policy articulation and revision to make privacy policy documents conform to better coherence and structure.


INTRODUCTION
Privacy policies are legal documents that communicate practices relating to consumer data collection, management, use, and sharing. They serve as the primary means to inform users about the data collected from them and the controls that are provided to manage this process while being compliant with associated rules and standards [19]. However, despite being written in a natural language, there are several obstacles with privacy policies that prevent the general public from effectively utilizing them to make informed privacy-related decisions. Along these lines, the readability and clarity of privacy policies are significant concerns and are often introduced by how policies are written and structured. The average length of privacy policies is over 2500 words, and they are typically difficult to read and comprehend [43]. This makes users less likely to try to read or understand what is written in the policies.
In order to improve the utilization of privacy policies, recent advancements in automation, machine learning, and deep learning have led to the creation of tools that let users access the information contained in a privacy policy without requiring additional help from policy writers [55]. Classification of privacy policy texts is the most popular approach in this context, which enables users to selectively obtain a high-level overview of a policy in terms of pre-defined category labels [4,29,46,47,55,66]. The labels encompass presumed data practices of consumer importance, such as data collection, sharing, choice and control, regulatory conformance, security, and data retention, among others. However, while the automated classification of policies helps with readability and comprehension, the inherent issues with privacy policies persist. Prior studies have highlighted the challenges associated with readability [23,24,34,44,65], ambiguity [40,53,54], and accessibility [28,32,34] of information. The difficulties associated with information presentation have also changed over time, with frequent policy revisions required to reflect changes in data practices. Additionally, each category of information in a policy uses a different style of language and has its unique set of issues. Depending on how the categories are organized in a policy, these categorical descriptions, when combined, create the challenges and issues that privacy policies display.
Categorization of policy texts is yet to be used to investigate the particular problems related to each category, how these categories interact with one another, and how they have changed over time. Therefore, it is essential to look into historical policy patterns to identify the causes of privacy policies being persistently challenging to grasp. Thus, this study explores the following questions to examine how privacy policies have evolved over the last two decades.
• How has categorical information evolved in privacy policies over time? • What is the general pattern for information coherence and organization within each category? • How does different types of information in a policy interact with one another to communicate data practice information? • What particular aspects of information gain prominence as privacy issues become more concerning?
We aim to address these questions by leveraging a sentence-level classifier that can individually categorize the sentences in a privacy policy. First, the classifier is applied to over one million policies, spread over 130, 604 organizations, dating from 1997 to 2019. After that, we analyze each policy's categorical information to understand how privacy policy semantics and categorical structures have changed over time. We then identify the general trends in information organization and coherence within each category and whether they have made policies more accessible. Through this process, we contribute to expanding our understanding of the challenges in the domain of usable privacy policies, summarized in the following items.
(1) This work contributes to the advancement in knowledge of the semantic and categorical evolution of privacy policies, as determined by current trends in policy articulation and amendment. Our extensive longitudinal investigation from a categorical information standpoint reveals aspects of privacy policy trends. (2) We identify practices in place that hinder effective notice and choice through analysis of the trend in readability, coherence, entropy, and inter-category relation over time, as introduced by policy revisions. (3) Our results reveal category-specific problems and the intercategory relationships that compound categorical problems, thereby contributing to the complexity of privacy policies. We also assess the results within the purview of the General Data Protection Regulation (GDPR), which demonstrates that information completeness does not imply effective policy communication. (4) We draw attention to the classification of privacy policies at the sentence level when examining privacy policies and exemplify how such a granular analysis can reveal crucial factors that influence the usability of a privacy policy.
Note that we analyze the largest existing privacy policy corpus with an automated approach to draw comprehensive elemental evolution patterns and generate a much closer representation of policy trends than that from a selectively chosen smaller corpus, which may not include all aspects of privacy policy articulations in existence over the years and risk inaccuracy in representations due to selective study. Preparing a corpus for qualitative study to obtain any statistically significant evolution patterns would mean manually annotating a significant portion of policies across different websites and multiple years. We believe quantitative analysis with a classifier is better suitable for a task of such magnitude.
The remainder of the paper is organized as follows. Section 2 presents related work in the field. The methodology and evaluation metrics we use in the analysis are described in Section 3. Section 4 presents the observed tendencies in privacy policies, followed by a discussion in Section 5. Finally, we conclude in Section 7 after discussing limitations and future work in Section 6.

RELATED WORK 2.1 Privacy Policy Challenges
Privacy policies face accessibility, readability, comprehension, and ambiguity challenges from various stakeholder perspectives. Evaluation of privacy policies using empirical readability metrics reveals that the majority of the population is unable to comprehend privacy policies, which calls for at least a college-level reading proficiency [23,24,34,44]. The number of different ways a policy can be interpreted makes privacy policies ambiguous. In order to give organizations more flexibility, policymakers frequently use ambiguity or vagueness [54]. Occasionally, even legal and policy professionals need help to agree on a clear-cut policy composition [53,54]. Ambiguity is further introduced by the need to have complete information in these policies [13,40]. In addition, many users need help retrieving specific information from a policy [28], and sometimes locate where the policy is mentioned in a website [30,32,34]. Most of these prior works point to critical challenges of privacy policies; however, they focused on analyzing a single snapshot of the policies. Our longitudinal analysis explores if the challenges have been persistent over time.

Privacy Regulations and Recommendations
Several agencies have proposed methods and regulations to increase the transparency of data access practices in an organization [15]. For instance, the National Telecommunications and Information Administration offers guidelines for establishing policies for mobile applications [48], while the European Article 29 Working Party offers recommendations for IoT devices [7]. On the other hand, the General Data Protection Regulation (GDPR), which applies to data processing platforms in Europe, requires greater transparency in privacy policies [64]. In addition, the Federal Trade Commission (FTC) recommends adopting clear and straightforward privacy laws [16]. Other U.S. national security regulations, including the Video Privacy Protection Act (VPPA), the GLBA and HIPAA Privacy Rules governing finance and healthcare, respectively, the COPPA rule governing children's information, and others, all place restrictions on how privacy policies are written [50].
Even with these privacy regulations, data protection principles sometimes need to be clarified, and manual conformance verification can be error-prone [15,58,62]. Additionally, the design of regulations overlooks the cognitive frame of personal intrusion to comprehend privacy issues [45,50]. Solutions have been proposed to address such issues. However, these methods frequently experience a lack of acceptance, as is the case with P3P machine-readable privacy policies [2,8,17,18] and their extensions, alternative privacy formats [14,21,26,61], and graphical practice icons [20,27,35,51]. Thus, it is imperative to explore how the policies are designed and what we can learn from them from a bird's eye perspective for the various stakeholders.

NLP for Privacy Policies
Natural language processing (NLP) has been adopted as the preferred approach to extract pertinent privacy information from a policy document. NLP solutions work directly on policies currently present in most organizations without requiring additional cooperation from the organization. In the NLP application field for privacy policies, a variety of subjects are addressed, including information extraction [6, 10ś12, 31, 59], content summarization [69], automated question-answering [56], and information alignment [41,52]. Among the NLP-based solutions, the classification of privacy policy texts is the most researched sub-domain.
Classification in Privacy Policies. Since the feasibility of text classification in privacy policies was established by Ammar et al. [4], we have seen many studies aimed at enhancing policy classification models through the training and testing of various machine learning models [42,66,67]. Following the development of neural network and deep learning models, we have seen an increase in the use of these models in the privacy domain as well, to further advance the capabilities of segment (paragraph) categorization tools [29,46,47]. Although segment classifiers are used for most categorization in the privacy policy domain, there is no established method for segmentation. Additionally, segment classifications cannot adequately reflect the more precise information available at the sentence level [1]. However, few studies have looked at sentence classification. For the categorization of both segments and sentences, Liu et al. tested SVM, LR, and CNN models [42]. Other researchers have used sentence categorization to perform specific tasks, including determining whether a sentence describes a user choice instance [9,36,57,60]. Although NLP-based methods like categorization aim to make information in current policies accessible, they may also aid in deconstructing privacy policies which collectively pose challenges to users. Thus, we analyze how privacy policy material has changed over time with respect to composition, readability, and structural changes.

METHOD
A privacy policy comprises categorical information, with each category conveying a specific type of information. We begin our study by implementing a sentence-level classifier that can identify the data practice category contained in each policy sentence. Compared to a paragraph-level classifier, a sentence-level classifier can provide a better overview of the information organization in a policy, especially when a paragraph can contain a mix of information [1].

Sentence Classification
3.1.1 Training Data. We use sentences from policies in the OPP-115 corpus to implement a sentence-level classifier. OPP-115 is a corpus of 115 website policies and has 12 high-level data practice categories annotated by legal experts [66]. The annotation scheme and tools were created after carefully considering labeling techniques for policy segments that crowd workers could use to produce in-depth policy annotations [67]. The high-level categories are łFirst-Party Collection/Use (FPCU), ž łThird-Party Sharing/Collection (TPSC), ž łUser Choice/Control (UCC), ž łUser Access, Edit, and Deletion (UAED), ž łData Retention (DR), ž łData Security (DS), ž łPolicy Change (PC), ž łDo Not Track (DNT), ž łInternational and Specific Audiences (ISA), ž and łOther. ž łIntroductory/Generic (IG), ž łPractice Not Covered (PNC), ž and łPrivacy Contact Information (PCI)ž are the three subcategories that make up the łOtherž category. łPractice Not Coveredž refers to ambiguous descriptions that cannot be confidently tagged with any other category. Please refer to Appendix A for brief descriptions of these categories.
Since the OPP-115 corpus has only segment (paragraph) annotations and attributes annotations for partial sentences in a segment, we manually annotated the 10, 717 sentences in OPP-115 with the 12 established high-level categories for segments. Table B1 lists the frequency of sentences in each category. The sentences were annotated by a trained qualitative researcher and any discrepancies were resolved by another annotator. For annotating the sentences, we did exclusive coding with single label categorization. Privacy policies are consumer facing and our goal is to analyze the effectiveness of these policies for users, thus our coding is not done from a legal perspective.
3.1.2 Learning Models: BERT and XLNet. We selected BERT and XL-Net for training and evaluation to decide on our final sentence-level classifier. In recent studies, BERT and XLNet have both surpassed previously bench-marked CNN-based models such as Polisis [29] in policy classification [1,46,47]. Additionally, pre-trained models for BERT and XLNet can be fine-tuned with the downstream task, such as training a custom word embedding. BERT is a transformersbased deep learning model that can accurately capture the contextual relationships between words and subwords [22,63]. An encoder reads the text input for the transformers, and then the task output is predicted by a decoder. BERT is by nature bidirectional in terms of encapsulating contexts since the entire string of words is read at once, capturing both the left and right context of each word. XLNet is also a transformer-based model that gathers context from both forward and backward directions but also considers all permutations of a sequence of tokens [68].

Training and Evaluation.
We train BERT and XLNet classifiers using the FastBert package 1 . The training computer contained an Intel(R) Xeon(R) E5-1620v4 3.50GHz CPU, 8GB of RAM, and an NVIDIA RTX A5000 GPU. Transformer-based deep learning requires sufficient CUDA cores to train the models and to produce predictions. We used the 8, 192 CUDA cores and 256 Tensor cores of the NVIDIA RTX A5000 GPU for the training. We modified the FastBert learner to utilize mixed precision training, which significantly boosts computational efficiency by using half-precision (16-bit) for most tasks and single precision (32-bit and 64-bit) in critical parts of the network. We followed the approach adopted by Mustapha et al. while configuring BERT and XLNet for training, with batch sizes of 8, a default learning rate of 10 −3 , and a total of 5 training epochs [47]. A single training epoch in XLNet took 5.8 minutes in our hardware, whereas a single training epoch in BERT took 3.2 minutes.
We utilized a standard 10 × 9 nested cross-validation approach with a 9:1 train/test split for method selection. A 10 × 9 nested cross-validation utilizes 10 different train/test splits of the data set. Each training set is then used to perform a 9-fold cross-validation, and the best model (determined by a loss metric) gets evaluated on the test set. This gives us 10 estimates of the method's performance, which we average to determine a final value. The average precision (Pr), recall (Re), and F1-score (F1) of the two methods are shown in Table 1. Given that we trained 90 models for each method, the nested cross-validation for XLNet took ≈44 hours and for BERT ≈24 hours. While BERT has better precision than XLNet in a few categories, including łData Retention,ž łData Security,ž and łDo Not Track, ž the micro-and macro-averages demonstrate that XLNet marginally outperforms the BERT classifier. As a result, we selected XLNet as the method of choice for our sentence classifier. Finally, we trained a new instance of XLNet using all the annotated data and subsequently used this model for sentence-level classification. We observe that categories such as łIntroductory/Genericž and łData Retentionž have much fewer instances in the corpus. When used with nested cross-validation, the examples become more sparse in the training data. This can lead to poor performance of a model for low-frequency categories. As a result, observations relating to these categories may have a larger margin of error.
Our evaluation reveals that the classifier has high precision and recall for most categories. Table 1 shows a detailed breakdown of the classifier's performance with respect to different categories in the test data. For some categories such as łPrivacy Contact Informationž, the classifier shows relatively lower performance in comparison to other categories. The lower performance in classification may be attributed to the fact that when linguistic characteristics pertaining to a specific category overlap with other categories, a classifier has difficulties resolving label ambiguities and creates misclassifications. Nonetheless, since the final XLNet model is trained on the entire corpus, we expect its performance to be łslightlyž better.

Privacy Policy Evaluation
With the implemented XLNet sentence classifier, we investigate the evolution of content in privacy policies regarding categorical composition, readability, and structural changes over revisions.

Data for Analysis.
In our study, we analyze the Princeton Privacy Crawl (PPCrawl) corpus, which was compiled by Amos et al. using a crawler that locates, downloads, and extracts historical privacy policies from the Internet Archive's Wayback Machine [5]. PPCrawl 2 is a repository of 1,071,488 English language privacy policies from 130, 604 different websites, organized by policy date and website Alexa rating. PPCrawl contains policies from 1997 to 2019, although the collection lacks a copy of any company's policy version for each year. Due to this, PPCrawl's number of policies varies between the years; the number of policies from each continent for each year is shown in Figure 1.
We first segregate the sentences of the 1, 071, 488 policies using the sentence tokenizer in NLTK and then use our XLNet classifier to categorize the sentences of each policy. Due to the data's magnitude, processing and categorizing the entire corpus took about 22 days. We performed a small verification exercise to observe the quality of the XLNet predictions on this unseen data. For this, we verified the correctness of the predictions for each sentence (total 1, 858 sentences) in the latest available policies of the top-10 Alexa-ranked websites in PPCrawl, as well as on 2, 000 randomly sampled sentences. We observed a macro-average precision and recall of 0.89 and 0.92 respectively in the former, and 0.93 and 0.95 respectively in the latter. Categories such as łIntroductory/Genericž and łData Retentionž have F1-scores above 0.90 in both instances. Table B2 and B3 lists the detailed metrics from these exercises.
We treat unavailable policies for an organization in a given year as missing data and ignore those instances when computing statistics for the year.

Composition.
We calculate the proportion of each category in each policy in order to study the categorical composition of privacy policies over time. Additionally, we calculated the length of each sentence and established the typical sentence length for each policy category. To better understand ignored categories and their effects on a policy's construction, we also examine the fraction of policies in a year with missing categories.

Semantic Change with WMD.
Semantic change describes how a text's meaning changes over time. Privacy policies may introduce a semantic change in the event of a revision. We detect the semantic change and quantify the extent of the change using Word Mover's Distance (WMD). WMD measures the disparity between two text documents as the smallest distance in a vector space between embedded words in one document and embedded words in the other [37]. We embed a policy's text using Polisis's [29] privacyspecific word embedding rather than XLNet's contextualized word embedding. Depending on the context of a word's appearance, contextualized word embedding could have a different vector for the same word. Since it is probable that in policy reform, a simple restructuring of language without any genuine change in meaning can occur, this will lead to a different embedding for the same information and create a positive WMD (indicating false semantic change). Thus, we employ a static word embedding specific to privacy policies, which will always have the same embedding for the same information.
In the analysis, we determine the semantic differences between the current policy of an organization and the next newer version of that organization's policy that is available in PPCrawl. These differences are computed between texts belonging to the same category to assess the changes at a categorical level.

Flesch Reading Ease.
We rate each policy's readability for each category using the Flesch reading ease score. A Flesch reading ease score of more than 90 is considered łvery easyž to read, while numbers below 30 imply łvery confusingž texts (see Table B4 for intermediate levels). Sentence and word counts of a text are considered when computing this score. The Flesch reading ease score has been used in the past for readability assessment of entire privacy policies [23,24,34,44]. However, it is possible that the wording used in some specific categories reduces the readability score of the entire document. In order to ascertain the readability trend for each category, we compute the score separately for each category.

Altieri's Spatial Entropy.
A key component of information discourse is the categorical organization of data; better organization of categories in a policy leads to better information accessibility. Therefore, less information uncertainty exists when similar categorical information is collated in a policy, i.e., when it is organized categorically. Traditionally, information uncertainty is computed using Shannon's entropy. Shannon's entropy is formally defined as the expected value of an information function that measures the amount of information about each category. However, the spatial locations of information are also significant when studying the degree of uncertainty in category placement. Shannon's entropy cannot recognize the significance of space when information uncertainty must be evaluated over a spatial region. In order to assess the uncertainty in the way that categories are organized in a policy, we use Altieri's spatial entropy [3]. Spatial entropy measures the distribution of categories across a document. For example, if other categories are interleaved between łUser Choice/Controlž practices, then all user choice and control descriptions are scattered between other descriptions, and will ultimately increase the entropy of the category (indicating disorganized or unaligned practice descriptions). Altieri's spatial entropy combines residual entropy and mutual information. While residual entropy measures the amount of information in one variable after the effect of another variable is taken into account, mutual information measures the information that two variables share. It is outside the scope of this paper to go into details about how the two values are calculated, but interested readers can refer to the original work [3]. We used the sentence number to specify where a specific piece of categorical information is located in a document. We computed the metric for each category using the SpatialEntropy library 3 .
3.2.6 Self-Attention Based Coherence. Coherence in information is an essential aspect of information discourse. Making broad connections between various textual components is necessary for reading.
A key evaluating factor is the consistency of a policy's various components. Unrelated sections would be found in a poorly-written policy, whereas relevant sections with closely related terminology would be found in a well-written policy. For example, a sentence such as łWe collect location information from the userž, followed by łThe location information is used to recommend useful services in the localityž is less coherent then łWe collect location information from the user to recommend useful services in the localityž. Coherence in used language thus determines the overall connection of information, as opposed to entropy which determines the organization of content in a policy.
We employ Li et al. 's self-attention-based entity coherence evaluation metric to track long-distance relationships between words and produce a coherence score [39]. A vector with values between 0 and 1 related to a word's location is used to express position encoding after first obtaining the word embedding from Stanford's free source 50-dimensional GloVe embedding [49]. The input matrix to self-attention is then formed using word embedding and position embedding to capture the associations between each pair of words throughout a policy. Finally, the connection between word pairs is created as a series of word vectors, and input into an LSTM (Long Short-Term Memory) neural network. A fully connected layer with a nonlinear activation function calculates the final coherence score.
We make use of the implementation of this technique in the LingFeat package [38]. However, instead of focusing on one category at a time, we compute the coherence score utilizing a complete policy. This is because only taking into account text from one category at a time leaves out content from other categories and misrepresents the coherence of a section of text.

Co-occurrence Matrix.
Our analysis also examines how one category relies on another to convey privacy-focused information. We examine each paragraph (segment) of each PPCrawl policy to understand the relationships between various categories. Typically, privacy policy segments consist of one or more categories. The cooccurrence counts of categories are calculated using the category of each sentence in a paragraph. For instance, we increase the count for the "First-Party Collection/Use" and "Introductory/Generic" category pair if statements from both categories are in the same paragraph. This gives us a category co-occurrence matrix for each of the policies. The evolution of the relationship between categories is then examined using these matrices.

RESULTS
We computed the evaluation metrics discussed in Section 3.2 on the PPCrawl corpus, and present here some trends and observations based on those metrics 4 . We present more discussion about these observations in Section 5. in 2019) is steadily declining. The second most under-represented area, łUser Access, Edit, and Deletion, ž further highlights the lack of control users have over their data once it has been collected. In over 50% of the policies from 2000 to 2017, this category went neglected and suffered from ongoing carelessness. With the implementation of the GDPR's right-to-access obligation, we see that after 2017, more than 60% of websites address this category, increasing to more than 70% in 2019. The plot also reveals that ≈97% of websites cover łUser Choice/Controlž practices in 2019, illustrating that most websites communicate some form of control choices.

Semantic Change.
Semantic change refers to changes in the meaning of words in a sentence. In privacy policies, such changes can emerge during a policy revision. Any diff-based technique may be used to identify changes in sentence additions or deletions, as seen in [5]. To further understand the prevalence of semantic change, using the WMD metric, we examine the proportion of policies that alter their categorical content each year ( Figure 5). Note that semantic change for a policy in a given year is computed with respect to a latest available prior version. Statistics on these semantic change values are computed by normalizing over the total number of available policies in the given year. In other words, we compute the proportion of policies in a year by dividing with a denominator value given by the number of available policies in the year. We also look at the distribution of this metric over the years ( Figure 6).
The WMD values are relative and have no standard unit for reference; zero denotes no change, and relatively higher values denote more significant changes than earlier versions. Since PPCrawl does not always have a policy for a specific organization every year, to calculate the semantic change in a policy revision, we contrast the policy with the most recent version available before the said policy. Figure 5 shows that łUser Choice/Controlž is the category that experiences the second most frequent semantic change. In łUser Choice/Control,ž semantic modifications are made to 26% of the policies on average; the most significant percentage was seen in 2018 when over 30% of the policies underwent semantic changes. Figure 6 illustrates the magnitude of these changes, which is relatively minimal except in 2018 and 2019. As a result, users interested in regulating access to their data should carefully revisit policies after revision since the opt-in or opt-out options (e.g., links) presented to them may change regularly with minor adjustments. Figure 5 also shows a high correlation (0.99) between alterations in łFirst-Party Collection/Usež and łThird-Party Sharing/Collectionž statements. However, in contrast to first-party practices, third-party practices see relatively more semantic shifts ( Figure 6). The interdependence between the two activities raises the risk of confusing data collection and sharing (with third parties) due to the lack of differentiation and the possibility of ambiguity between the two types of practices.
The least commonly changed category is łData Retentionž; however, compared to other category modifications, the semantic shift in data retention has a larger magnitude (as indicated by the IQR for łData Retentionž in Figure 6). This suggests that businesses may abruptly adjust their retention policy in a significant way. It is worth noting that data retention statements such as łOnce it is no longer necessary for us to retain your personal information, we will dispose of it securely according to our data retention and deletion policiesž (eBay 2018 policy), may introduce ambiguity as to what factors determine that personal information is no longer necessary. The second least commonly changed category is łUser Access, Edit, and Deletion,ž and the magnitude of the change is second only to "Data Retention. " "Introductory/Generic" statements are changed frequently yet have the most negligible magnitude of change. This implies that a category will likely see significant changes if not frequently modified. An average of 11% of policies reporting changes to their data security methods did so with minor semantic alterations, except for adjustments brought on by the GDPR in 2018. Only 10% of the policies saw average updates to data security methods every year from 2002 to 2018. Statements relating to łInternational and Specific Audiencesž have consistently seen changes over the years, despite regulations governing them being introduced infrequently. It indicates a continual effort across multiple organizations to correctly parse regulatory text and satisfy the stated requirements.

Comprehensibility of Privacy Policies
Elements such as readability, information coherence, and organization influence how approachable a privacy policy is to a user, whether it is easy to understand and access, and ultimately whether users can keep track of a policy's points in context with related information required to comprehend the described practices.

Readability.
The readability of a privacy policy is among its most essential aspects. Figure 7 depicts the category-wise proportion of policies in a year that fall into different reading levels. For most policies, łFirst-Party Collection/Usež and łThird-Party Sharing/Collectionž have a difficult readability level. However, since 2000, the percentage of confusing policies has steadily risen for both categories. łData Securityž practices have the highest percentage of confusing texts throughout the years, which can be due to the use of technical language in their descriptions. In contrast, the categories of łUser Access, Edit, and Deletionž and łDo Not Trackž indicate progress over the years, with a drop in the proportion

Controlled Modification of Policy Text.
We discovered that whenever a service provider expands its data practices, a corresponding description is also included in the policy. As features are developed, data practices change, sometimes leading to the abolition of older features and the accompanying data practices [55]. Therefore, it is preferable to modify the current policy's language as little as possible to include the new practices and eliminate the outdated ones. Our investigation demonstrates that adding a new practice to a revision is indeed doable by only changing an existing sentence. For example, consider the two sentences from Yahoo's 2002 privacy policy: łFor some financial products and services, we may also ask for your address, Social Security number, and information about your assets.ž and łWe collect information about your transactions with us and with some of our business partners, including information about your use of financial products and services that we offer. ž, which could have been modified into a single sentence such as, łWe may also gather your address, Social Security number, and details about your assets with us and some of our business partners for certain financial products and services, along with transaction information and service usage. ž. This might have averted the introduction of a new sentence in the policy revision and improved readability by removing the need for readers to piece together information from two statements in different places. We see two benefits of being disciplined in how a policy is modified. Firstly, the policy length will increase by a minimum, limiting the length of time it takes to read the policy. Secondly, altering the current material will help remove outdated policy practices, which could not be eliminated by merely introducing new descriptions. As a result, outdated information is not accidentally shared.

Coherent Information.
While some policies show high levels of content coherence, the vast majority have deficient levels of coherence among the many statements written to support practices. Therefore, it is essential to rewrite policies to integrate relevant materials to create coherent explanations of practices. For instance, the two sentences from Facebook's 2017 policy, łWe collect information from or about the computers, phones, or other devices where you install or access our Services, depending on the permissions you have grantedž and łWe may associate the information we collect from your different devices, which helps us provide consistent Services across your devicesž, are combined into a single sentence in Facebook's 2019 policy, łWe collect information from and about the computers, phones, connected TVs and other web-connected devices you use that integrate with our Products, and we combine this information across different devices you use. ž. The material was made more coherent by simply integrating the two linked statements.

Categorical Policy Issues
Examining the whole policy is a common way to determine if privacy policies are usable. However, our research findings indicate that each privacy category has a unique set of issues, resulting in poor notice and decision-making when combined.

User Control and Choice Consistency.
Our findings demonstrate that the most commonly modified policy feature is the user's control and choices. It is reasonable for such a change to happen in terms of altered first-or third-party behavior. However, we see regular, modest changes in "User Control/Choice," which are separate from "First-Party Collection/Use" and "Third-Party Sharing/Collection." This suggests that while choice and control descriptions depend on the two categories, they do not dictate a change in the user's choice and control. The need for more independence between control and choice from the first-and third-party practices suggests that these are not regarded as integral parts of the end-to-end process and are treated as secondary goals, resulting in policy revisions. The implementation of privacy compliance should be open, transparent, and planned across all employed processes [33]. Control choice revisions may be minimized by łUser Choice/Controlž policies that are more consistently implemented and founded on approved data collection, usage, and sharing methods. One method for maintaining such consistency is to have a fixed link to a separate page that lists user choice, control, and policy, instead of frequently changing opt-in/out weblinks in the policy text [34]. In addition, having a static page that lists all the control links will avoid the need to consult a policy to find the most recent link.

Minimizing Introductory and Generic Statements.
Generic and introductory statements are frequently mixed with other categories, which affect the categorical organization of the policies. Additionally, the coherence among related descriptions can suffer if generic statements are inserted between related statements. Generic statements are meant to make the policies easier to use, but they may instead obfuscate the vital information provided by other categories. According to our analysis, "Introductory/Generic" statements change the most frequently, with minor changes in magnitude but rising entropy values over time. Therefore, it is necessary to reduce the use of introductory and generic statements to improve privacy policies. These claims only exist to facilitate the effective communication of other categories. However, if the other category statements are made to stand alone as complete statements, the need for łIntroductory/Genericž statements can be reduced to a minimum.

Dissociation of Categories.
Our analysis demonstrates that privacy categories frequently have strong relationships with one another and frequently depend on one another to describe practice information. At the same time, this method of articulation aims to set a particular category in a context. However, this method risks adding a category's problems to the description of a different category. For instance, adding a long first-party practice description as a context for the choice or control may make the already challenging łUser Choice/Controlž statements even more difficult to find. Therefore, as a whole, accessibility becomes increasingly tricky.
Furthermore, combining other categories also introduces ambiguity in the description, complicating policy usability. For example, the categories in a pair, such as łUser Choice/Controlž and łUser Access, Edit, and Deletion,ž or łFirst-Party Collection/Usež and łThird-Party Sharing/Collection,ž represent distinct concepts yet ambiguous in the description due to high correlation. Furthermore, it can be challenging to distinguish between two concepts when descriptions of the two concepts in a policy are highly co-occurring.

GDPR Impact
The percentage of policies missing information specific to a given category significantly decreased across all categories in the wake of the GDPR implementation in 2018. Nevertheless, even after 2018, the categories of łUser Access, Edit, and Deletionž and łData Retention, ž which directly align with the GDPR's łright of access, ž łright to rectification, ž łright to erasure, ž and łright to be informed about the retained data policy, ž continue to be the most neglected among all the categories. This implies that while GDPR has improved several policies, a significant number of policies still require attention to disclose information fully.
Despite GDPR's positive impact on the openness of privacy policy information disclosure, information organization for each category unfortunately decreased. Both łData Retentionž and łUser Access, Edit, and Deletionž categories observed a rise in entropy after 2019, a sign of increased disorder. Prior to this, the categories had a definite placement in a policy, despite the practice descriptions having less transparency. The readability of the categories also observed a decline.
łData Retentionž and łUser, Access, Edit, and Deletionž also co-occur alongside łUser Choice Control/Choice, ž łData Security, ž łFirst-Party Collection/Use, ž łThird-Party Sharing/Collection, ž and łIntroductory/Genericž sentences. This indicates that łData Retentionž and łUser, Access, Edit, and Deletionž lack a specific role concerning privacy practices and are often described in a nonstandardized context that introduces ambiguity in a policy. łUser Choice/Controlž sentences also observed a similar trend. People desire fine-grained control when disclosing their information [25]. Although the number of policies without łUser Choice/Control,ž which was already relatively low before GDPR, did not change significantly due to GDRP, the organization and readability of these choice descriptions were adversely affected. In addition to being the most frequently changed category, post-GDPR website policies made these descriptions even more challenging to comprehend.
The structure of each policy category has generally declined with the implementation of GDPR, making policy communications more disorganized. In addition, while the length of privacy policies rose dramatically as a result of GDPR, the readability and consistency of the material have remained the same. Consequently, the sole beneficial effect of GDPR was to offer users more information; nonetheless, the user may still need help finding this information.

LIMITATIONS AND FUTURE WORK
We presented results from our analysis of privacy policies from 1997 to 2019 in the PPCrawl corpus, spanning over 20 years. However, the lack of policies post-2019 is a limitation of this work. The availability of post-2019 policies will provide a better overview of how organizations continue to address regulations such as GDPR and whether efforts are underway to make privacy policies more approachable. Additionally, in practice, policies might vary considerably depending on the nature of the business. For instance, privacy policies communicating practices of a social media organization are articulated differently than privacy policies referring to banking or financial domains. We have not considered domainspecific analysis for this work. A business domain-specific selective examination of policies may also highlight characteristics that set different firms apart and reveal the problems and inclinations they are likely to face. From a method's perspective, correctly identifying policy statements about low-frequency categories is challenging. While deep-learning methods such as XLNet have demonstrated potential in identifying most categories, alternative approaches such as ensemble modeling coupled with cost-sensitive learning may be required to tackle issues with low availability of examples in specific categories.

CONCLUSION
Privacy policies are the primary means of distributing information on privacy practices and notifications to consumers. This study provides the results of a large-scale, longitudinal, category-based analysis of privacy policies spanning more than 20 years. We provide a holistic overview of the problems with privacy policies at a categorical level and track their evolution using a sentence-level classifier (XLNet). The implemented classifier aided in analyzing the composition, semantics, and structure of privacy policies over time. While specific categories see more frequent changes than others, we saw an overall rise in the informational completeness of privacy policies, positively reinforcing the transparency of these policies. However, the frequent changes make it challenging to trace the modifications implemented over time.
Additionally, we found that each category has its own unique set of readability and structural issues, and these category-specific problems are enhanced further with inter-category dependencies. Finally, it is concerning to note that, even though the problems in these categories are getting worse, a policy's textual content has a continuously low degree of coherence, with little to no evidence of an effort to improve comprehensibility. We offer some suggestions to improve the state of each category, such as the dissociation of categories and minimization of generic sentences, intending to keep privacy policies more approachable. Additionally, this study's findings can help develop better policies by adopting categoryspecific articulation practices and adhering to practices that do not incrementally make policies more challenging as they undergo various revisions.

A DATA PRACTICE CATEGORIES
The 12 data practice categories used in the classification have the following generic meaning [66].