Automated Extraction and Presentation of Data Practices in Privacy Policies

Authors: Duc Bui (University of Michigan), Kang G. Shin (University of Michigan), Jong-Min Choi (Samsung Research), Junbum Shin (Samsung Research)

Volume: 2021
Issue: 2
Pages: 88–110


Download PDF

Abstract: Privacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated analysis, it remains challenging due to the complex policy statements written in legal (vague) language. Prior work is limited by small/generated datasets and manually created rules. We formulate the extraction of fine-grained personal data phrases and the corresponding data collection or sharing practices as a sequence-labeling problem that can be solved by an entity-recognition model. We create a large dataset with 4.1k sentences (97k tokens) and 2.6k annotated fine-grained data practices from 30 real-world privacy policies to train and evaluate neural networks. We present a fully automated system, called PI-Extract, which accurately extracts privacy practices by a neural model and outperforms, by a large margin, strong rule-based baselines. We conduct a user study on the effects of data practice annotation which highlights and describes the data practices extracted by PI-Extract to help users better understand privacy-policy documents. Our experimental evaluation results show that the annotation significantly improves the users’ reading comprehension of policy texts, as indicated by a 26.6% increase in the average total reading score.

Keywords: privacy policy; dataset; presentation; annotation; user study; usability

Copyright in PoPETs articles are held by their authors. This article is published under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license.