Secure and Scalable Document Similarity on Distributed Databases: Differential Privacy to the Rescue

Authors: Phillipp Schoppmann (Humboldt-Universität zu Berlin and Alexander von Humboldt Institute for Internet and Society, Berlin, Germany), Lennart Vogelsang (Humboldt-Universität zu Berlin and Alexander von Humboldt Institute for Internet and Society, Berlin, Germany), Adrià Gascón (Work done while at the Alan Turing Institute, London, UK. Now at Google, London, UK.), Borja Balle (Work done at Amazon Research, Cambridge, UK. Now at DeepMind, London, UK.)

Volume: 2020
Issue: 2
Pages: 209–229


Download PDF

Abstract: Privacy-preserving collaborative data analysis enables richer models than what each party can learn with their own data. Secure Multi-Party Computation (MPC) offers a robust cryptographic approach to this problem, and in fact several protocols have been proposed for various data analysis and machine learning tasks. In this work, we focus on secure similarity computation between text documents, and the application to k-nearest neighbors (k-NN) classification. Due to its non-parametric nature, k-NN presents scalability challenges in the MPC setting. Previous work addresses these by introducing non-standard assumptions about the abilities of an attacker, for example by relying on non-colluding servers. In this work, we tackle the scalability challenge from a different angle, and instead introduce a secure preprocessing phase that reveals differentially private (DP) statistics about the data. This allows us to exploit the inherent sparsity of text data and significantly speed up all subsequent classifications.

Keywords: text analysis, document similarity, multiparty computation, differential privacy

Copyright in PoPETs articles are held by their authors. This article is published under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license.