On the Quality of Privacy Policy Documents of Virtual Personal Assistant Applications

The app ecosystem built around virtual personal assistant (VPA) services has become flourishing in recent years. In response to the increasingly stringent data protection regulations, VPA service providers require application developers to include a privacy policy that declares their data handling practices. These privacy policies serve as the de facto agreement between developers and users, and may be taken as the basis for resolving conflicts in the event of a data breach. Therefore, it is essential that privacy policy documents are crafted in a clear, easy-to-understand, and unambiguous way. In this work, we conduct the first systematic study on the quality of privacy policies in the VPA app domain. We identify four metrics that enable the quality of the privacy policy to become measurable, including timeliness , availability , completeness , and readability . We then develop QuPer , which extracts the meta features (e.g., up-date history) and linguistic features (e.g., sentence semantics) from privacy policies, and assesses their quality. Our analysis reveals that the status of the quality of privacy policies in the VPA app domain is concerning. For instance, only 1.17% of privacy policies completely cover all contents that are regarded as privacy concerns by legislation (e.g., GDPR Article 13) and relevant literature. Our findings are expected to raise an alert among the VPA app developers and provide them with guidelines for creating high-quality privacy policy documents.


INTRODUCTION
AI (artificial intelligence)-backed virtual personal assistant (VPA) services, such as Amazon Alexa [8] and Google Assistant [32], have gained tremendous popularity in recent years.Centered around them, an ecosystem similar to the one among mobile applications, which has proven a big success in the last decade, is growing rapidly.The VPA services enable third-party developers to create VPA applications (or apps for short), e.g., skills in Amazon Alexa and actions in Google Assistant, and release them through app stores.This allows users to easily enable and use the apps through their smart speakers.According to a recent report from Statista [77], VPA services have reached a global user base of billions.
The openness of this ecosystem raises a great privacy concern though.Dishonest VPA apps can appear in the app store, and once installed, can collect sensitive user information, such as location, name, age, and gender.Although current VPA services require the developers to declare the permissions [9] whenever their skills request to access personal data, some malicious apps can still bypass this and gather information at runtime, as revealed by recent studies [35,81].The privacy threats not only put users at risk, but also largely pose challenges for app developers and VPA service providers, particularly in the global context that many countries and regions have enacted stringent data protection legislation, e.g., the well-known European Union (EU) General Data Protection Regulation (GDPR) [40].Any privacy breach can result in significant penalties to the data controllers and processors.For example, on July 16, 2021, Amazon was fined 746 million euros by Luxembourg's National Data Protection Commission for failing to comply with GDPR in protecting its users' data [10].
VPA services have indeed taken steps to mitigate privacy concerns.Taking Amazon Alexa as an example, skill developers are required to release a privacy policy document that discloses how their skills handle user data, including access, collection, use, and sharing of user data [10].Nonetheless, the enforcement of this requirement remains problematic.A recent study [80] shows that out of 65,195 Alexa skills collected, only 21,063 skills provide privacy policies.Even though some skills do provide one, its quality is worrisome.Additional studies [16,81,82] find that many of the available privacy policies are incomplete, contain inaccessible resources or technical terms, or use a language that mismatches the skill's support languages.These issues greatly cause difficulties for users in understanding the privacy policies of the skills they use.
The privacy concerns of the VPA app ecosystem have also raised high attention from the research community.A line of research has been dedicated to detecting the runtime information gathering behaviors of VPA apps [35,47,79].Some recent studies [81,82] propose to examine the compliance between VPA apps' data handling behaviors and the statements in their privacy policies.However, the complementary problem of what are the common problems of VPA apps' privacy policy documents and how to guide the VPA app developers to develop a high-quality privacy policy remains open.
Our Work.In this work, we conduct the first systematic study to assess the quality of VPA apps' privacy policies.We first formulate a taxonomy to break down the quality of privacy policy documents into measurable metrics.To this end, we resort to two major data regulations, i.e., EU GDPR [40] and California Consumer Privacy Act (CCPA) [59], the standards working groups of European Data Protection Board (EDPB) that form privacy policy guidelines [27,28], and the literature on privacy policies in VPA and other domains like mobile apps and websites [45,50,51,57,71,74].We summarize their quality concerns and propose four quality metrics, including timeliness, availability, compliance of disclosure, and readability.
Based on our taxonomy, we develop a framework named QuPer to assess the quality of the privacy policy, using machine learning and natural language process (NLP) techniques.It automatically collects and synthesizes meta features (e.g., update history) and lingual features (e.g., the semantics of sentences) of the given privacy policy.By analyzing these features, it can align the updates to the release and revision of an app (i.e., the timeliness), track the availability of the linked resources and required multilingual versions (i.e., the availability), assess the coverage of the contents required by legislation like GDPR Article 13 (i.e., the completeness), and evaluate its writing style (i.e., the readability).
We conduct a large-scale study with QuPer on all 65,195 available Amazon Alexa skills, the apps of the most popular VPA service, to understand the landscape of the quality of privacy policies among modern VPA apps.QuPer reveals that the current status of the privacy policy quality is concerning.Only 5,473 privacy policies are well formatted, and only 1.17% (64/5,473) have complete contents.The privacy policy documents refer to 200,183 external links, but 28,745 (14.3%) have become invalid.More than half of the analyzed privacy policies have issues for users to read and understand.Contributions.The main contributions of this work are as follows.
• Understanding the quality of the VPA app privacy policy.
We conduct the first comprehensive study on the quality of the VPA privacy policy.Our work proposes four quality metrics to measure different aspects from writing styles and temporal features, to contents and semantics.• A systematic approach of privacy policy quality measurement.We develop QuPer that can automatically extract meta features and textual features of the privacy policy.It features a two-step document processing method that derives contextsensitive semantics from the sentences.This endows QuPer with the capability of inferring fine-grained information to determine content coverage and readability.• Revealing the status quo of privacy policy quality in existing VPA apps.We study the landscape of privacy policy quality among Alexa skills.Our findings reveal that the current status of privacy policy quality remains concerning.Our work should raise an alert to VPA app developers, and encourage store operators to take actions for quality assurance.It can be extended for policy quality auditing in other domains.

A TAXONOMY OF QUALITY METRICS
To effectively assess the quality of privacy policies, it is necessary to establish a set of measurable metrics.In this section, we take the first step to construct a taxonomy of quality metrics.Since there is not a comprehensive list for both VPA and other domains, we resort to three sources to summarize the concerns on the privacy policy quality (Section 2.1), and based on them, we formulate the taxonomy for our assessment (Section 2.2).

Identifying Public Concerns on Privacy
Policy Quality

2.
1.1 Sources for identification.As user privacy protection becomes increasingly important, legislators, standard working groups, and the research community have made efforts to establish privacy policy guidelines or examine privacy-related documents.We thus turn to these sources to summarize their concerns regarding the quality of privacy policies, in order to create our taxonomy.Data Regulations.We first turn to two major data regulations, i.e., the EU GDPR [40]  Guidelines.Besides data regulations that mainly provide highlevel principles, we also consider guidelines from relevant working groups.In particular, we look into those provided by Article 29 Working Party [28] which is a team made up of 28 national regulators from multiple European Union countries.This team is dedicated to the protection of individuals with regard to the processing of their personal data, and aims to promote data protection by enforcing GDPR.In 2018, it was promoted into the EDPB (European Data Protection Board), gaining more legal weight to push through decisions.
We review the Guidelines on transparency under Regulation 2016/ 679 document [27] from this working party, given that transparency is relevant to the GDPR principle of disclosing the data handling practices to the data owner.We mainly focus on the Articles 12 and 13 of this document, which are under łClear and plain languagež (pp.8 in [27]), as they present the requirements for the writing of privacy policies.More specifically, Article 12 requires that the privacy policy should be in clear and plain language.The information provided in the privacy policy should be simple and easy to understand, and the use of complex sentences and grammar should be avoided.Article 13 proposes that language qualifiers such as łmayž, łmightž, łsomež, łoftenž, and łpossiblež should be avoided in the privacy policy.It also requires that when the data controller is targeting data subjects in one or more languages, a set of privacy policies in these respective languages should be provided.Literature.We refer to the literature to recognize quality metrics from the concerns of the research community.We start with Micheti et al. [57] since this is an early study on the recommendations for drafting privacy policies (with a specific target of young people).It proposes the guidelines of the privacy policy writing based on user studies and reveals the concerns from the perspective of users, to complement those summarized from legal documents.We then track other publications that cite it.This has yielded 13 publications on the impact of GDPR on privacy policy study [11,12,51,53,70,74], privacy policies corpus [6,61,71] and privacy policy language modeling [34,37,45,50].
2.1.2Quality concerns.We review these collected materials, and collect the cases that are considered as quality concerns (QCs), as detailed below.QC1.Noncompliance of disclosure.GDPR Article 13 [41] and Article 29 Working party [28] clearly state the extent of information to be disclosed to the data subject.Two studies [48,51] particularly focus on the policies' contents in terms of the disclosure of personal data collection.They identify several issues of noncompliance between the disclosed information and regulations.For example, users are not fully notified about what information is collected by the application [54].Therefore, we identify the noncompliance of disclosure as one of the quality concerns that we aim to investigate in this work.QC2.Out-of-date information.Many skills adopt a short release cycle to enable faster response to the market changes and customer needs [19], and their data handling behaviors may be altered often.California Consumer Privacy Act (CCPA) [59] requires that a privacy policy should be updated timely (at least once every 12 months) to reflect the company's most recent practices.Besides the CCPA requirement, a user study [43] reveals that whether the privacy policy is kept up-to-date concerns the public as well.QC3.Inaccessible resources.Many skills resort to external sources in their privacy policies to provide auxiliary information.For example, some policies often include links to a third-party website, to direct users to the entity with whom the personal information is shared.However, they may miss updating these links when the external party has disabled the website link, such that the users may lose relevant information.Several studies [54,58] have revealed that such failures are not uncommon and have raised users' concerns.Therefore, the accessibility of links in the privacy policies is identified as a quality concern in our work.
Another resource we take into consideration in this work is a skill's multi-language versions.According to Article 29 Working Party [28], ła translation in one or more other languages should be provided where the controller targets data subjects speaking those languagesž.The skills that declare a list of supported languages in their descriptions are thus supposed to also provide a set of privacy policies in these languages.QC4.Obscure, complex, and lengthy texts.Several studies express users' difficulty with the language of the privacy policies [56,57,73].Features that mostly affect readers' comprehension include grammatical and syntactic features, which refer to words and sentence structures in the privacy policies, and organizational features, which refer to document characteristics such as document length and logical order of information presentation [55,57].

The Taxonomy
With the identified quality concerns, we generalize each of them into a quality assessment taxonomy.This results in four high-level metrics, including compliance of disclosure (mapping to QC1), timeliness (QC2), availability (QC3), and readability (QC4).We further break down each of them into items that could be measured with automatic techniques.

Compliance of Disclosure.
The compliance of disclosure of a privacy policy document refers to the extent that its disclosed information fulfills the requirement of regulations, store operators, and users.In this regard, GDPR Article 13 [40] presents the contents that are required to be included in a privacy policy.Several recent studies [11, 34, 51ś53, 74, 81] also summarize types of contents from derived from data regulations or user studies.We summarize these types into 11 significant components (detailed in Table 1), and QuPer assesses the compliance of a privacy policy based on its coverage of these components.Coverage of components.Several studies have attempted to identify significant privacy concerns by conducting user studies [34], or analyzing requirements of data regulations [51ś53, 74,81].By reviewing these studies, we have identified eight essential components of a privacy policy that are their common concerns, including Access, Choice, Collect, Cookies, Purpose, Retention, Security, and Share.Second, we turn our consideration to the spectrum of VPA apps' user groups.Prajapati et al. [60] reveal that the lower uptake of smartphones among children and the elderly can be primarily attributed to intricate Human-Computer Interaction (HCI) models, rather than limitations in their cognitive abilities.Given that VPA apps boast a conversational UI that inherently demands minimal learning aptitudes and behavioral comprehension, they tend to attract a higher proportion of children users than their mobile counterparts, as shown in a recent work [42].We thus include Children [70] as a required component.Third, as skills are open to a wide range of regions (some of which are under the protection of specific regulations, e.g., CCPA for US California users), Region [21] is included.Fourth, as skills are typically updated relatively frequently, we include Update.Finally, Provider is included, as the Alexa store obligates developers to disclose their contacts, to enable users to request deleting collected information.

Timeliness.
For the timeliness, three aspects are assessment.Time to release.Once an app is released, the associated privacy policy is supposed to be available.Therefore, we assess the difference between the first release time of the skill and that of its privacy policy.
Update frequency.The skill has undertaken many updates and some updates may introduce changes in its data handling practices.Accordingly, its privacy policy should be updated to reflect the change.We thus monitor the whole life cycle of the skill, and check whether its privacy policy is kept up-to-date.We are in compliance with the requirements of COPPA (Children's Online Privacy Protection Act), we do not collect any information from anyone under 13 years of age.

REGION
Protection Mechanisms for Some Special Regions If you are a resident of the state of California, we will abide by the regulations of CalOPPA when handling your information.

UPDATE
Whether the privacy policy will be updated Please note that this Privacy Policy may be periodically updated.Please refer to our website for the latest Privacy Policy that is in force.9 PROVIDER Contact information of the privacy policy provider If you have questions of your personal data, you may raise them at any time by contacting us at: xxx@gmail.com.10 RETENTION How long will the skill keep user data We will retain user-provided data for as long as you use Bathroom Sidekick and for a reasonable time thereafter.We will retain Automatically Collected information for up to 24 months and thereafter may store it in aggregate.11 DATA_USE How the skill will use user data The information we collect is used to improve our website in order to better serve you.
Adaptability and agility to incidents.Some events may stimulate developers to update the privacy policies.Security incidents, e.g., a data breach, may also raise alerts to developers for the update.We thus analyze the relevance of the releases and updates on privacy policies with the occurrence of known security incidents.

Availability.
We consider the availability of the following two resources according to the identified quality concerns.
Link validity.This refers to the accessibility of the external links.Coverage of languages.This refers to whether a skill provides privacy policy in the language versions as it claims in the list of supported languages.

Readability.
As identified in the quality concerns, we aim to assess a privacy policy's readability from the two features that mostly affect users' comprehension.Grammar and Syntax.This focuses on the effect of textual elements (e.g., words and sentences) on comprehension.We include most of the representative features discussed in relevant studies [28,57].Avoiding double negatives.A double negative means that a positive statement uses two negative elements to produce a positive force.For example, łWe will not share your information with organizations or institutions that we do not work withž actually means łWe will share your information with organizations or institutions if we work with them.žAccording to a relevant study [57], the latter one is easier to comprehend.
Avoiding obscure language qualifiers.Article 29 [28] requires that łLanguage qualifiers such as 'may', 'might', 'some', 'often' and 'possible' should also be avoidedž.For example, the statement łWe may use your personal data for research purposesž violates this requirement.
Locating the main idea of the sentence at the beginning.Sentences are easier to read and comprehend when the main idea occurs at the beginning [57].For example, łWe do not share personal information such as name, address, email address, or phone number with othersž can be better understood than łPersonal information such as name, address, email address, or phone number is not shared with third parties by usž.Organization and Structure.This focuses on the document features, such as the length and logical order of the information.
Text structure.Structural features that may affect readability include the number of sentences in a paragraph, the number of words in a sentence, and the number of syllables in a word, as revealed by [29,67].These features represent the difficulty level of the paragraph, the sentence, and the word, respectively.
Logical order of information.The content should be arranged in a way that is presented in a logical order [57].For example, the COLLECT section in Table 1, which builds the context of what information is being collected, should appear in front of the SHARE section, which describes how the skill shares user information, to facilitate readers' understanding.

OVERVIEW OF QUPER
We embed the proposed quality metrics into a framework named QuPer, to assess the quality of privacy policies.In this section, we brief the process of privacy policy collection (Section 3.1) and the assessment techniques for each quality metric (Section 3.2).

Data collection and preprocessing
We obtain a list of all skills available in the Alexa skill store from the dataset used in a recent study [81].It includes 65,195 skills, of which 21,063 skills provide links to their privacy policies.We filter out the skills that provide duplicate privacy policy links, and 9,136 skills are left.We use a crawler to scrape the privacy policy documents through the obtained links.During the crawling, 584 URLs cannot be opened and 1,245 URLs return the pages of 404 not found.As such, 7,307 privacy policy documents are obtained.Among the obtained documents, some are not related to privacy policies (e.g., a company home page).We thus filter out those that do not include the keyword łprivacyž or łuser informationž.To ensure the validity of the filter, we conduct a manual confirmation on 200 policy links randomly selected.The results are listed in Table 2. Overall, 6,430 privacy policies are kept and formulate the cohort for our timeliness and availability assessment.
We further build a crawler to retrieve the skill home pages and use the Beautiful Soup library [2] to extract the language version information of privacy policies from the home pages for the assessment of the availability of supported language.Among the 6,430 privacy policies, we further filter out the privacy policies that are not presented in HTML format and leave 5,473 documents for the completeness and readability assessment (as QuPer relies on the HTML structure, e.g., HTML tags, to automate the analysis, as detailed in Section 4 and 7).

Compliance Assessment Methods
To address the challenge posed by the unformatted and heterogeneous nature of privacy policies (Challenge #1), we design our assessment methods by combining machine learning and NLP techniques.Considering that our dataset is relatively small-scale, we propose a lightweight and fine-grained approach so that it can be more precise than using a pure multi-label multi-class classification.
Our approach involves analysis at both the section and sentence levels, aiming to capture precise information about the sentences in the privacy policies.
4.1.1Section-level analysis.The privacy policy is usually structured into sections, and each section covers one of the components listed in Table 1.For example, the component of COLLECT is often presented in a section that is titled semantically similar to łInformation we collectž.The vast majority of privacy policy documents (>90%) follow this format [81].Therefore, we propose to use the section titles as input to train a classifier that categorizes the purpose of the section (according to Table 1) to assess whether the document includes the corresponding components.Section Title Extraction.Section titles are typically represented as heading elements in HTML and can be identified by their surrounding tags (listed in Table 3).These tags cannot be used as filters straightly though, as some irrelevant contents, such as advertisements and navigation bars, may also use HTML tags to highlight the text.We thus need to identify the particular type tag which is used to highlight the section titles in the policy texts.To this end, we first define five reference phrases, which are the top five most frequently appearing relevant phrases in privacy policy section titles (łinformation collectž,łinformation usež, łchange dataž, łdata securityž, łcontact usž).We then extract the text embedded in heading tags from the privacy policy in HTML format and use Jaccard similarity coefficient [36] between the extracted text and reference phrases to identify section titles, as defined below.
where  denotes an extracted phrase and  denotes a reference phrase.For example, in a privacy policy that contains both łh2ž tags and łh3ž tags, texts extracted from łh2ž tags are łNewsž, and łBlogž, both of which are of zero Jaccard score to the reference phrases.Texts extracted from łh3ž tags in the document are łWhat information we collectž, łSecurity of dataž, and łCookiesž, which are of 0.50, 0.66, and 0.0 Jaccard scores to the reference phrases, respectively.We treat the tag with the highest average value of the Jaccard score as the section title tag.That means, in the above example, we identify the łh3ž tag as the section title tag.We then extract the texts which are highlighted by the identified tag as section titles.Section Title Classification.With the extracted data, we build a classifier to classify privacy policy sections into the components defined in Table 1.Data labeling.We invite five researchers from our institution to conduct the data labeling.All of them have research experience in privacy policies and one has a law background.To ensure the accuracy of data annotation, we first provide them with a brief tutorial and some annotation samples.We ask them to label 48 section titles in five privacy policies and explain their labeling in a group discussion, ensuring that their criteria are calibrated.After that, we randomly select 140 privacy policies and ask them to annotate all included (1,503) section titles with 12 labels (i.e., the 11 components given in Table 1, plus the label łOTHERž).
Training.We use the support vector machine (SVM) [38] and the Naive Bayes classification to train our classifiers, given that both are known to have strong capacity in handling relevant tasks [17,65].The performance of the classifiers is shown in Table 4, in which the Naive Bayes classifier (81.82%F1-score) achieves a higher accuracy on average compared to SVM (79.75% F1-score).

Sentence-level analysis.
The section-level analysis can infer the purpose of each section, and we conduct sentence-level processing as complementary.This is to handle the obstacle that a section contains multiple components or two section titles overlap.For example, a section named łCollection and use of informationž contains both COLLECT and DATA_USE contents (łWe collect your information such as name, email, and address.We use such information only for statistical purposes that help us design and administer the Sitež), which causes the section-level classification to have a low recall on the COLLECT component, as shown in Table 4.
Our sentence-level analysis begins with defining the categorization thresholds.To this end, we adopt Spacy [3] to retrieve the most frequently used predicative verbs for each component as shown    Privacy policy release time vs. Skill release time.Figure 3(b) shows a comparison of privacy policies' release time and their corresponding skills' release time.We find that more than half of the privacy policies are published before the corresponding skills are released, among which 43 privacy policies are released 11 years earlier than the skills.This occurs mostly because the developer straightly directs their users to the privacy policy of their other services (e.g., the website), rather than specifically creating a precise one based on the data handling behaviors of the skill.This should raise an alert to the public because Alexa skills usually collect user information in a broader range of ways (such as recording the user's voice) than traditional web services.
In addition, we find that a number of privacy policies are released long after their skill are published, e.g., more than one hundred privacy policies are released three years after the release of their corresponding skills.During this period, the skill's behavior is not governed by the privacy policy, which puts users' privacy at risk.The trend of privacy policies' release and update.We investigate the trend of privacy policies' release and update from 2018 to 2022. Figure 4(a) shows that the number of newly released privacy policies peaks in 2020, followed by a rapid decline in 2021 and 2022.This result is consistent with the skill releases trend as shown in Voicebot report [44].Although the number of skills is still increasing, its growth rate has decreased since the end of 2019.
Figure 4(c) shows that most privacy policies' updates took place in 2021 and 2022, while 197, 186, and 78 privacy policies have their last update time in 2020, 2019, and 2018, respectively.Adaptability and agility to incidents.Figure 4(b) and 4(d) demonstrate the number of policy releases and updates per month from 2018 to 2022.As shown in Figure 4(b)-2018, the first peak of policy release is in August 2018.This can be because Amazon allows skill providers to add in-skill purchases and Amazon Pay functions in Alexa skills in May 2018 [31], which stimulates the skills' profit growth and results in an influx of skill providers.Another peak of policy release appears in September 2020.This aligns with the rapid growth of skill users, which peaks in early 2020 [44].
In Figure 4(d), we observe that the first peak of policy updates occurs in July 2021.This is the time when Luxembourg National Commission for Data Protection (CNDP) levied the largest GDPR violation fine of 746 million Euros against Amazon [20].We also find that, in 2022, the number of updated privacy policies is relatively high from January to August, since more than 30 incidents of data leakage occurred during that period [22].
Finding 2: We observe that 16% (813/4,879) of privacy policies are never updated.We find that 813 Amazon skills have been released for more than two years and have never updated their privacy policies.the largest number of missing language versions of the privacy policy is in Spanish, followed by German.
Finding 6: We observe that 40% (2,602/6,430) skills do not provide required privacy policy language versions as they declared in the łSupported Languagež sections.

READABILITY ASSESSMENT
In this section, we present our readability assessment methods (Section 7.1) and our results (Section 7.2).

Readability Assessment Methods
We examine the grammatical and syntactic features including double negative, obscure language qualifiers, and main idea's location, and the organizational features including sentence length and logic order of information.We now detail them respectively.Avoiding double negatives.It occurs when a sentence contains two grammatical negation forms.Therefore, to detect double negative sentences, we use keyword matching to check the number of negative words and contrast words in a sentence.We use the negative and contrast vocabulary word list as the reference for the identification [25,26,39].Avoiding obscure language qualifiers.The Article 29 Data Protection Working Party [28] states that łlanguage qualifiers such as łmayž, łmightž, łsomež, łoftenž and łpossiblež should be avoided.Therefore, we use these keywords to conduct fuzzy semantic tests on the sentences of privacy policies.Locating the main idea of the sentence at the beginning.We determine the main idea of a sentence by locating the subject and predicate in this sentence [49].We resort to the Python SpaCy library to find the index position of the subject (nsubj) and predicate (ROOT) in each sentence, and then use the following Equation 2 and 3 to calculate the positions of the main idea (  ) depending on the length of the sentence.
where   is the central index,   is the index of subject, and   is the index of predicate.We add the index of the subject and the predicate and then divide it by two to get the central index.
where    is a Boolean variable used to determine whether a main idea is at the beginning.We do not include the cases in which the sentence length is less than 5 as they are rare according to the literature [69].When the sentence length is greater than 5 and less than 20, we compare the central index   with half the sentence length.When the sentence length is greater than 20 and less than 27, we compare the central index to one-third of the sentence length, and we compare the central index to a quarter of the sentence length when the sentence length is greater than 27.We apply this method to 20 randomly selected sentences and achieve 90% accuracy in identifying the main idea location.Table 9 shows two examples in which the second sentence achieves a lower central index   than the first one.Text structure.We first adopt three readability metrics that have been widely used by other studies [29,64] for measuring the document readability, namely Automated Readability Index (ARI), Flesch Readability Ease Score (FRES), and Laesbarheds Index (LIX).The first metric, i.e., ARI, is used to assess the required reader's education level to understand a document, and the other two, i.e., FRES and LIX, assess the difficulty level of the document based on its average number of syllables per word, total words, and total sentences.In addition, we select seven metrics from existing studies [15,29,66,68,78] that specifically focus on assessing the impact of document structure on readability.These metrics have been revealed to have a significant impact on readability, and include letters per word (LPW), syllables per word (SPW), words per sentence (WPS), sentence count (SC), word count (WC), reading time (RT) and speaking time (ST).In Table 11 of Appendix B, we list details of all ten metrics QuPer considers.Logical order of information.Recent studies show that a coherent organization of sections/paragraphs in a document can largely ease readers' comprehension [7,30,75].From existing literature, we summarize the following guidelines of the presentation order.
• Put content in a time sequence.
• Present the general information before the specific one.
• Discuss things that affect many people before those that affect few.
• Present permanent provisions before temporary ones.
Based on these guidelines, we then investigate a desirable arrangement of the section in a privacy policy through a user study.We recruit 23 volunteers, ten of whom have a major in law and have experience in writing legal documents.The other 13 major in computer science.All of them have experience in reading privacy policies.We prepare a tutorial with the summarized guidelines for the volunteers, and ask them to sort the sections in the order they find logical for them to interpret the policy.We use the majority vote among the volunteers to produce the final order of the sections.Our study has been guided by an ethics committee member in our university.We list the tutorial in Table 12 in Appendix C.

Readability Assessment Results
Results of grammatical and syntactic issues.Figure 8 shows the distribution of privacy policies including double negative sentences (3,086), privacy policies including obscure language qualifiers sentences (4,687), and privacy policies including sentences whose main ideas are not at the beginning (4,139) among 5,473 privacy policies.
• Avoiding double negative.We observe that 56% (3,086/5,473) of skills' privacy policies contain double negative forms.Among the 3,086 privacy policies, 2,010 privacy policies contain no Consistency between privacy policies and data practices.Another line of research conducts consistency checking between privacy policies and actual behaviors.Andow et al. [12] propose POLICHECK based on PolicyLint [11] and AppCensus [4], to check the entity-sensitive consistency.Lentzsch et al. [46] conduct the first worldwide large-scale analysis of Alexa skills, focusing on the skill certification process.They also examine the consistency between skill privacy policies and their actual behaviors, and find that skills in the łkidsž category exhibit the most severe violations, which necessitates the inclusion of the CHILDREN category in our compliance checking.Manandhar et al. [53] conduct an empirical large-scale analysis of smart home devices.They focus on examining the availability and coverage of privacy policies, aiming to gain insights into the current state of privacy disclosure within the smart home ecosystem.Xie et al. [81] develop Skipper to detect the noncompliance between skills' behaviors and their declared profile.
QuPer identifies emerging concerns such as children-and regionspecific policies, and includes them in the taxonomy to enhance the completeness of its assessment.It also proposes corresponding assessment techniques within each quality metrics, taking into consideration the challenges in the VPA context.For example, due to the lack of a large-scale corpus, it uses a two-level classification (see Section 4.1).

CONCLUSION
In this work, we conduct the first systematic study on the quality of privacy policies in the VPA app domain.We develop QuPer, which aims to automatically extract the meta features and lingual features, and assesses the privacy policy quality of VPA apps (i.e., Amazon Alexa skills) based on them.QuPer proposes four quality metrics to measure different aspects of the VPA privacy policy quality and uniquely develops a two-step document processing method to analyze VPA privacy policy documents.Our work reveals a concerning state of current VPA privacy policy quality and raises an alert to the VPA app developers.We therefore encourage store operators to set up regulatory mechanisms to ensure the high standards of VPA privacy policies.
Proceedings on Privacy Enhancing Technologies 2024(1) Yan et al.

A STATUS CODE DETAILS
Table 10 provides the details of the status codes returned from the website mentioned in Section 6.1.[62], and has also been widely used in various studies [24,29] to measure the readability of articles and paragraphs.It calculates a numerical score based on factors such as sentence length and word difficulty, providing valuable insights into the reading level required to comprehend a particular piece of text.Flesch Reading Ease score (FRES) [63] is widely adopted by individuals and organizations seeking to ensure that their written content is accessible and easily understandable to their target audience [13,23].It provides an indication of how easy or difficult a text is to understand by considering both sentence-level complexity and word-level complexity.Laesbarheds Index (LIX) is another metric that has been utilized to determine the difficulty of documentation [29,83].It measures a text's readability based on factors such as sentence length, complexity, unusual words, main words, and different words used.One of its advantages is that its reliability considers a wide range of age groups, from children's literature to adult reading materials.It is 238 words per minute for English silent reading and 183 words per minute for speaking [14].Privacy policies should be tailored to align with users' reading habits to determine the length and content of the document.

B OVERVIEW OF READABILITY METRICS
Word Count (WC) łWord countž refers to the total number of words present in a given text or context.Reading time (RT) łReading timež refers to the estimated time it takes for an individual to read a particular piece of text or content.It is a measurement used to provide readers with an estimate of how long it will take them to go through the material.

Speaking time (ST)
The time it would take for the average person to say this text aloud at a rate of 125 words per minute.

Figure 1 :
Figure 1: A taxonomy of privacy policy quality metrics.

( b )Figure 4 :
Figure 4: Trend of privacy policies' release/update from 2018 to 2022

0- 29 =
It is very easy to read, 30-39 = It is easy to read, 40-49 = It is a little hard to read, 50-59 = It is hard to read, 60 = It is very hard to read Syllables per Word (SPW) łSyllables per wordž is a measure that calculates the average number of syllables in each word.It indicates the complexity of words in a given context.1.5 = Second Grade, 1.6 = Third to Eighth Grade, Adult reading average SPW = 1.77Word per Sentence (WPS) łWords per sentencež refers to the average number of words in each sentence.It helps determine the length and complexity of sentences in a given text.10.6 = Second Grade, 13.9 = Third to Fifth Grade, 14.7 = Sixth to Eighth Grade, Adult reading average WPS = 15.24Letters per Word (LPW) łLetters per wordž measures the average number of letters in each word and provides insights into the word complexity and length in a given context.Adult reading average LPW = 5.24 Sentence Count (SC) łSentence countž refers to the number of sentences present in a given text or context.
and the California Consumer Privacy Act (CCPA)[59].The GDPR is one of the pioneering comprehensive data protection laws that came into effect in 2018.It is designed to protect the personal data and privacy of individuals within the EU by imposing strict regulations on how organizations collect, store, process, and transfer their data.The CCPA is a state-level privacy law in California, United States that went into effect in 2020.It applies to businesses that collect, use, or disclose the personal information of California residents.While it has most principles in common with GDPR, it specifically grants California consumers the right to know what personal information businesses are collecting about them and to request that such information be deleted.

Table 1 :
List of the main components that a skill privacy policy should cover data collected by the skill We may collect: your name, birth date, gender, email address, zip code, and any other information you may voluntarily provide to us.2 COOKIECookie from user's device collected by the skill Some service providers use cookies or similar tracking technologies in order to provide you with promotions or other contents on the basis of your browser activities and interests.

Table 3 :
HTML heading tags captured in privacy policies

Table 5 :
Most frequently used predicative verbs of components and average pair-wise similarity

Table 6 :
QuPer's performance in identifying required components in privacy policies POS stands for the number of positives and NEG stands for that of negatives.

Table 7 :
Component coverage among 5,473 privacy policies

Table 8 :
Component coverage of the privacy policies which include the COLLECT component

Table 9 :
Example of main idea sentences

Table 10 :
Status code detailsThe request is successful.

Table 11 :
Table 11 explains the readability metrics discussed in Section 7.1.Overview of Readability Metrics Metric Description Scope Mapping Automated Readability Index (ARI) The Automated Readability Index (ARI) is used by U.S. military to assess the grade level to read text