PRIVIC: A privacy-preserving method for incremental collection of location data

—With recent advancements in technology, the threats of privacy violations of individuals’ sensitive data are surging. Location data, in particular, have been shown to carry a substantial amount of sensitive information. A standard method to mitigate the privacy risks for location data consists in adding noise to the true values to achieve geo-indistinguishability (geo-ind). However, geo-ind alone is not sufﬁcient to cover all privacy concerns. In particular, isolated locations are not suf-ﬁciently protected by the state-of-the-art Laplace mechanism (LAP) for geo-ind. In this paper, we focus on a mechanism based on the Blahut-Arimoto algorithm (BA) from the rate-distortion theory. We show that BA, in addition to providing geo-ind, enforces an elastic metric that mitigates the problem of isolation. Furthermore, BA provides an optimal trade-off between information leakage and quality of service. We then proceed to study the utility of BA in terms of the statistics that can be derived from the reported data, focusing on the inference of the original distribution. To this purpose, we de-noise the reported data by applying the iterative Bayesian update (IBU), an instance of the expectation-maximization method. It turns out that BA and IBU are dual to each other, and as a result, they work well together, in the sense that the statistical utility of BA is quite good and better than LAP for high privacy levels. Exploiting these properties of BA and IBU, we propose an iterative method, PRIVIC, for a privacy-friendly incremental collection of location data from users by service providers. We illustrate the soundness and functionality of our method both analytically and with experiments.


INTRODUCTION
As the need and development of various kinds of research and analysis using personal data are becoming more and more significant, the risk of privacy violations of sensitive information of the data owners is also increasing manifold.One of the most successful proposals to address the issue of privacy protection is differential privacy (DP) [23,24], a mathematical property that makes it difficult for an attacker to detect the presence of a record in a dataset.This is typically achieved by answering queries performed on the dataset in a (controlled) noisy fashion.Lately, the local variant of differential privacy (LDP) [22] has gained popularity due to the fact that the noise is applied at the data owner's end without needing a trusted curator.LDP is particularly suitable for situations where a data owner is a user who communicates her personal data in exchange for some service.One such scenario is the use of location-based services (LBS), where a user typically sends her location in order to obtain information like the shortest path to a destination, nearby points of interest, traffic information, etc.The security and the convenience of implementing the local model directly on a user's device (tablets, smartphones, etc.) make LDP very appealing.
Typically, in exchange for their service, providers incrementally collect their users' data and then make them available to other parties which process them to provide useful statistics to companies and institutions.Obviously, the statistical precision of the collected data is essential for the quality of the analytics performed (statistical utility).However, injecting noise locally into the data to protect the privacy of the users usually has a negative effect on the statistical utility.Additionally, the noise degrades the quality of service (QoS) as well, since, obviously, the service results from the elaboration of the information received.
Substantial research has been done to address the privacy-utility trade-off in the context of DP.In LDP, the primary focus has been to optimize the utility from the data collector's perspective, i.e., devising mechanisms and post-processing methods that would allow deriving the most accurate statistics from the collection of the noisy data [22,57].In contrast, in domains such as location privacy, the focus usually has been on optimizing the QoS, i.e., the utility from the point of view of the users.In particular, this is the case for the framework proposed by Shokri et al. [47,49].
We argue that it is important to meet the interest of all parties involved, and hence to consider both kinds of utility at the same time.Hence, the first goal of this paper is to develop a locationprivacy preserving mechanism (LPPM) that, in addition to providing formal location-privacy guarantees, preserves as much as possible both the statistical utility and the QoS.Now, one may think that statistical utility and QoS are aligned since they both benefit from preserving as much original information as possible under the privacy constraint.However, this is not true in general: the optimization of statistical utility does not necessarily imply a significant improvement in the QoS, nor vice-versa.A counterexample is provided by Example 5.1 in Section 5. Hence, the preservation of both statistical utility and QoS is trickier than it may appear at first sight.
One of the approaches which have been proposed to protect location privacy is geo-indistinguishability (geo-ind) [4], which essentially obfuscates locations based on the distance between them.This idea works particularly well for protecting the precision of the location as it ensures that an attacker would not be able to differentiate between points that are close on the map by observing the reported noisy location.At the same time, it does not inject an enormous amount of noise that would be necessary to make far-away locations indistinguishable.Moreover, geo-ind has been shown to formally satisfy the basic sequential compositionality theorem [29], just like DP and its local variant.Although this approach of distance-based obfuscation seems enticing at a first glance, one of the issues it poses is that it may leave the geo-spatially isolated locations vulnerable, i.e., identifiable despite being formally geoindistinguishable [17].To improve the situation, [17] introduced the notion of elastic distinguishability metrics, which essentially leads to injecting more noise when the location to protect is isolated.
The Blahut-Arimoto algorithm (BA) [6,9] from rate-distortion theory (a branch of information theory) Pareto-optimizes the tradeoff between mutual information (MI) and average distortion.This property is appealing in the context of privacy because MI is often considered a measure of information leakage and average distortion is a commonly used metric for quantifying QoS.Moreover, BA was proven to satisfy geo-ind in [42] opening a door to study it as a potential LPPM.In this paper, we start off by exploring the privacy-preserving properties of BA and comparing them with those of the Laplace mechanism (LAP) [4] which is considered as the state-of-the-art mechanism for geo-ind.We show that, besides geo-ind, BA provides an elastic distinguishability metric and, hence, protects even the most isolated points in the map, unlike LAP.We then examine the statistical utility, focusing on the estimation of the most general statistical information, namely the distribution of the original location data (true distribution).The "best" estimation is known in statistics as the maximum likelihood estimation (MLE), and can be computed using the iterative Bayesian update (IBU) [2], an instance of the expectation maximization (EM) method.We discover a duality between BA and IBU, which in our opinion is quite intriguing, because BA and IBU were developed in different contexts, using different concepts and metrics, and for completely different purposes.We prove experimentally that the statistical utility of BA is very good, i.e., the MLE is very close to the true distribution.We conjecture that this is probably due to the duality between the mechanism that injects the noise (BA) and the one that de-noises the noisy data (IBU).In any case, the experiments show that the statistical utility of BA outperforms that of LAP for high levels of privacy, eventually becoming comparable as the level of privacy decreases.
One important point to note is that BA requires the knowledge of the original distribution to provide the optimal mechanism.When it is fed with only an approximation of the distribution, it only provides an approximated result.We acknowledge that the distribution of the original data is usually off-limits and, even when available, it typically gets outdated over time.In any case, we can soundly assume that it is not available because it is essentially the reason for collecting the data.Hence we have a vicious circle: we want to collect data in a privacy-friendly fashion to estimate the original distribution while wanting to use a privacy mechanism that requires knowing a good approximation of the original distribution.Motivated by this dilemma, we propose PRIVIC, an incremental data collection method providing extensive privacy protection for the users of LBS, while retaining a high utility for both them and the service providers, and ensuring that both parties, acting in their best interest, would benefit from the end mechanism.
Finally, we prove formally the convergence of PRIVIC to the true distribution and illustrate empirically the privacy-utility trade-off of our method.The experiments also demonstrate the efficacy of combining BA and IBU, in that the estimation of the original distribution is very accurate, especially when measured using a notion of distance between distributions compatible with the ground distance used to measure the QoS (e.g., the Earth Mover's distance).All the experiments were performed using real location data from the Gowalla dataset for Paris and San Francisco.Contributions.The key contributions of this paper are: (1) We show, analytically and with experiments on real datasets, that the BA mechanism, in addition to geo-ind, provides an elastic distinguishability metric.As such, it protects the privacy of isolated locations, which the standard LAP for geo-ind fails at.(2) We prove that BA produces an invertible mechanism, which means that the MLE is unique.This is crucial to prove that the IBU always converges to the true distribution and that, therefore, we can get a good statistical utility.(3) We establish a duality between BA and IBU, thus demonstrating a connection between rate-distortion theory and the expectation-maximization method from statistics.(4) We show experimentally that BA provides a better statistical utility than LAP for high levels of privacy, eventually becoming comparable as the level of privacy decreases.(5) Since the construction of the optimal BA requires precise knowledge of the true distribution, we propose an iterative method (PRIVIC) that alternates between BA and IBU, thus getting a better and better estimation of the true distribution as more (noisy) data get collected.We show, both formally and with experiments on real location datasets, that PRIVIC converges to the true distribution.In summary, PRIVIC produces a geo-indistinguishable LPPM with an elastic distinguishability metric, which optimizes the trade-off with the QoS and provides high statistical utility.(6) We investigate the effect on the privacy guarantees of our method by considering adversarial users who report their locations falsely to compromise the privacy of the isolated locations in the map.
Related Work.The trade-off between privacy and utility has been widely studied in the literature [11,36].Optimization techniques for DP and utility for statistical databases have been analyzed by the community from various perspectives [30,31,40].There have been works focusing on devising privacy mechanisms that are optimal to limit the privacy risk against Bayesian inference attacks while maximizing the utility [47,49].In [42], Oya et al. examine an optimal LPPM w.r.t.various privacy and utility metrics for the user.
In [43], Oya et al. consider the optimal LPPM proposed by Shokri et al. in [49] which maximizes a notion of privacy (the adversarial error) under some bound on the QoS.The construction of the optimal LPPM requires the knowledge of the original distribution, and [43] uses the EM method to estimate it and design blank-slate models empirically shown to outperform the traditional hardwired models.However, a problem with their approach is that there may exist LPPMs that are optimal in the sense of [49], but with no statistical utility, see Example 5.1 in Section 5. Furthermore, for the mechanisms considered in [43] the EM method may fail to converge to the true distribution.Indeed, [25] points out various mistakes in the results of [2], on which [43] intrinsically relies to prove the convergence of their method.
[44] proposed a method for generating privacy mechanisms that tend to minimize mutual information using an ML-based approach.However, this work assumes the knowledge of the exact prior from the beginning, unlike ours.Moreover, [44] does not provide formal guarantees for location privacy (e.g., geo-ind) which is one of the main aspects captured by our work.In [55], Zhang et al. consider the Blahut-Arimoto algorithm in the context of location privacy.However, their proposed method also requires the knowledge of the prior distribution to construct the LPPM.Additionally, [55] focuses on measuring privacy for the trace of a single user.On the contrary, our notion of privacy assumes the collection of single check-ins (or check-ins separated in time) by a set of users.
The Laplace mechanism has been rigorously studied in the literature in various scenarios as the cutting-edge standard to achieve geo-ind [4,7,29] and has been proven to be optimal for onedimensional data w.r.t.Bayesian utility [27].Despite its wide popularity, it has been recently criticized due to its limitation to protect geo-spatially isolated points from being identified by adversaries [17].The authors of [17] addressed this concern by proposing the idea of elastic distinguishability metrics.
Our paper also considers mutual information (MI) as an additional privacy guarantee.MI and its closely related variants (e.g.conditional entropy) have been shown to nurture a compatible relationship with DP [21].MI measures the correlation between observations and secrets, and its use as a privacy metric is widespread in the literature.Some key examples are: gauging anonymity [16,56], estimating privacy in training ML models with a typical cross-entropy loss function [1,35,44,51], and assessing location-privacy [42].
A popular choice of utility metric for the users is the average distortion, which quantifies the expected quality loss of the service due to the noise induced by the mechanism.Such a metric has gained the spotlight in the community [4,10,15,18,49] due to its intuitive and simple nature.On the other hand, a standard notion of statistical utility for the data consumer is the precision of the estimation of the distribution on the original data from that of the noisy data.Iterative Bayesian update [2,3] provides one of the most flexible and powerful estimation techniques and has recently become in the focus of the community [25,26].
Incremental and privacy-friendly data collection has been explored both in the context of -anonymity [5,12,13] and DP [32,52].However, to the best of our knowledge, the problem of providing a rather robust privacy guarantee while preserving utility for both data owners and data consumers has not been addressed by the community so far.
Plan of the paper.Section 2 introduces preliminary ideas from the literature relevant to this work.Section 3 highlights BA as an LPPM because of its extensive privacy-preserving properties.Section 4 establishes the duality between BA and IBU.Section 5 explains our proposed method (PRIVIC).Section 6 exhibits the working of PRIVIC with experiments using real locations from the Gowalla dataset illustrating the convergence of our method.Section 7 discusses and illustrates with experiments the vulnerability of PRIVIC under adversarial data submission and Section 8 concludes.Appendices A and B contain the proofs of the theorems derived in the paper and the relevant tables supporting the experimental analysis of PRIVIC, respectively.
The code used for implementing our mechanism for experiments is available at https://anonymous.4open.science/r/PRIVIC.

Standards of privacy
Definition 2.1 (-privacy, a.k.a.metric privacy [14]).For any space X equipped with a metric  :

Note that:
• Setting  as the discrete metric on any X, we obtain the definition of local differential privacy (LDP) [22].
• Setting X = Y = R 2 and  as the Euclidean metric, we get the definition of geo-ind [4].Definition 2.2 (Mutual information [46]).Let (,  ) be a pair of random variables defined over the discrete space X × Y such that  is the joint probability mass function (PMF) of  and  , and   and   are the marginal PMFs of  and  , respectively, and   | is the conditional probability of  given  .Then the (Shannon)  Remark 1. MI has often been used as a notion of privacy (and security) in the literature.In particular, [38] has provided an operational interpretation of MI in terms of an attacker model.On the other hand, other researchers have strongly criticized the use of Shannon entropy and MI as measures of privacy, see for example [50].
We do not take sides in this controversy: for us, MI is only a means to construct a mechanism that provides geo-ind under an elastic metric, which is our reference privacy notion.

Notions of utility
Definition 2.3 (Quality of service).For discrete spaces X and Y, let : X × Y → R ≥0 be any distortion metric (a generalization of the notion of distance).Let  be a random variable on X with PMF  X and C be any randomizing mechanism where C   is the probability of  being mapped by C into .We define the quality of service (QoS) of  for C as the average distortion w.r.t., given as: Definition 2.4 (Full-support probability distribution).Let  be a probability distribution defined on the space X.  is a full-support distribution on X if  () > 0 for every  ∈ X.
Definition 2.5 (Iterative Bayesian update [2]).Let C be a privacy mechanism that locally obfuscates points from the discrete space X to Y such that C   = P[|] for all ,  ∈ X, Y. Let  1 , . . .,   be i.i.d.random variables on X following some PMF  X .Let   denote the random variable of the output when   is obfuscated with C.
Let  = { 1 , . . .,   } be a realisation of { 1 , . . .,   } and  be the empirical distribution obtained by counting the frequencies of each  in .The iterative Bayesian update (IBU) estimates  X by converging to its maximum likelihood estimate (MLE) with the knowledge of  and C. IBU works as follows: (1) Start with any full-support PMF  0 on X.
The convergence of IBU has been studied in [2,25].For a given set of observed locations, the limiting estimate πX = lim  →∞   is well-defined by the privacy mechanism in use, C, and the empirical distribution of the noisy locations, .We will functionally denote πX as IBU(, C).
Next, we recall a generalization of IBU from the literature that we use in this work.Generalized IBU (GIBU) [26] applies IBU in parallel to several empirical distributions derived from the application of (possibly different) obfuscation mechanisms to various sets of samples from the same distribution.Definition 2.6 (Generalized iterative Bayesian update [26]).
Let  (1) , . . .,  ( ) , with for each  ∈ {1, . . ., } are i.i.d.samples from the discrete space X following the probability distribution  X .Let C (1) , . . ., C ( ) be  privacy mechanisms that locally obfuscate points from X to Y such that the mechanism C ( ) is applied to the dataset  ( ) and C Let G = C (1) . . .C ( ) be referred to as the combined mechanism a.k.a. the output probability matrix satisfying: GIBU estimates  X by converging to the maximum likelihood estimate (MLE) of  X with the knowledge of the noisy data and the obfuscating channels.GIBU works as follows: (1) Start with any full-support PMF  0 on X.  ), let πX (the MLE of the prior obtained with GIBU) be functionally denoted by: GIBU C (1) ,  (1) , . . ., C ( ) ,  ( ) .Definition 2.7 (Earth mover's distance [37]).Let  1 and  2 be PMFs defined over a discrete space of locations X.For a metric : X 2 ↦ → R ≥0 , the earth mover's distance (EMD) (aka the Kantorovich-Rubinstein metric) is defined as

𝜇(𝑥, 𝑦)𝑑(𝑥, 𝑦)
where Π( 1 ,  2 ) is the set of all joint distributions over X 2 such that for any  ∈ Π( 1 ,  2 ),  ∈X ( 0 , ) =  1 ( 0 ) and EMD is considered a canonical way to lift a distance on a certain domain to a distance between distributions on the same domain.
Definition 2.8 (Statistical utility).Let C be a privacy mechanism that obfuscates data on the discrete space X.Let  X be the PMF of the original locations and let πX be its estimate by IBU.Then we define the statistical utility of the mechanism C as ( πX ,  X ).

Optimization of MI and QoS
Definition 2.9 (Blahut-Arimoto algorithm [6,9]).Let  be a random variable on the discrete space X with PMF  X and (X, Y) be the space of all mechanisms encoding X to Y.For a distortion : X × Y ↦ → R ≥0 and fixed  * ∈ R + , we wish to find the mechanism Ĉ ∈ (X, Y) that minimizes MI given the bound  * on distortion: where, for any C ∈ (X, Y),   ,C is the random variable on Y denoting the output of the encoding of  .The Blahut-Arimoto algorithm (BA) provides an iterative method to construct Ĉ as follows: (1) Start with any full-support PMF  0 on X and any C (0) .
Remark 2. The equations ( 1) and ( 2) above define two transformations F : (X) → (X, Y) and G : (X, Y) → (X), where (X) is the space of distributions on X, so that C ( +1) = F (  ) and Remark 3. In [20], Csiszár proved the convergence of BA when X is finite.The limit lim →∞ (F • G)  (C (0) ) is the optimal mechanism Ĉ (parametrized by ), and it is uniquely determined by the prior  X and by the initial Remark 4. In [42], Oya et al. proved that, when  is the Euclidean metric, the mechanism Ĉ obtained from BA with loss parameter  satisfies 2-geo-ind.
In the context of the location-privacy, as addressed in this work, we obfuscate the original locations to points in the same space and, hence, for the rest of the paper we consider the spaces of the secrets and the noisy locations to be the same, i.e., X = Y.

LOCATION-PRIVACY WITH THE BLAHUT-ARIMOTO ALGORITHM
Definition 2.9 shows that the BA mechanism optimizes between MI and average distortion, which is a standard choice for measuring QoS.Furthermore, Remark 4 formally links the mechanism produced by BA with geo-ind, which is our reference privacy notion.
In this section, we investigate the privacy protection offered by BA beyond geo-ind, study the statistical utility it renders, and compare it with LAP, the canonical mechanism for geo-ind.

Elastic location-privacy with BA
One of the concerns harboured by geo-ind is that it treats the space in a uniform way, thus making isolated locations vulnerable to an attacker that knows the prior distribution.This issue has been raised and addressed by Chatzikokolakis et al. in [17] where the authors introduce a variant of LAP based on an elastic distinguishability metrics, which they refer to as elastic mechanisms.Such mechanisms obfuscate locations not only by considering the Euclidean distance between them but also by taking into account an abstract attribute of the reported location, called mass, which is a parameter of the definition.
Formally, if R elas is an elastic mechanism with privacy parameter  defined on X, then, for all ,  ∈ X, R elas must satisfy: where  is the probability distribution of the reported locations.Note that Equations 3 and 4 characterize the properties of an elastic mechanism R elas , but they do not define what R elas exactly is, as a function.In fact, as a definition, Equation 4 would be circular, since it uses the probability mass  generated by R elas without knowing what R elas is.As we will see, BA solves this problem by constructing the mechanism R elas as a fixpoint of a recursive process starting from a uniform output distribution .(To be precise the process is mutually recursive, alternating the generation of a new mechanism and a new output distribution, that, in turn, is fed into BA to generate the mechanism at the next step.)R elas , unlike LAP, protects a point in a densely populated area (e.g.city) and a geo-spatially isolated point (e.g.island) differently by considering not only the ground distance between the true and the reported locations but also the mass of the reported location.The exact mechanism depends of course on how we define the notion of mass.A natural way, and the most meaningful from the privacy point of view, is to set the mass of  to be the probability to be reported (from any true location ).Under this definition, the interpretation of ( 4) is in the spirit of obtaining privacy by ensuring that the set of possible true locations (given the reported one) is large.In other words, given a true location , we tend to report with higher probability those locations  that are reported with high probability from other locations as well so that it becomes harder to re-identify  as the original one.Note that this property is not incompatible with the geo-ind guarantee.However, LAP does not provide it.
Obviously, the definition of mass as the probability to be reported would be circular, because it would depend on the mechanism, which in turn is defined in terms of the mass.The authors of [17] do not explain how this mechanism could be constructed.Fortunately, the following theorem shows that an elastic mechanism of this kind can be constructed using BA.The proof is provided in Appendix A.
Theorem 3.1.The privacy mechanism generated by BA produces an elastic location-privacy mechanism.
Note also that there can be many mechanisms satisfying (3) and (4) (also with the mass interpreted as probability).The one produced by the BA is the mechanism that offers the best tradeoff between QoS and MI among these.Finally, a consequence of the connection with BA is that it provides an understanding of the elastic mechanism in terms of information theory and of the attacker illustrated in the previous section.
Experimental validation.Having furnished the theoretical foundation, we now enable ourselves to empirically validate that BA, indeed, satisfies the properties of the elastic mechanism unlike LAP, its state-of-the-art geo-indistinguishable counterpart.We perform experiments using real location data from the Gowalla dataset [19,39].We consider 10,078 Gowalla check-ins from a central part of Paris bounded by latitudes (48.8286, 48.8798) and longitudes (2.2855, 2.3909) covering an area of 8Km×6Km discretized with a 16 × 12 grid.
In order to demonstrate the property of an elastic mechanism, we artificially introduced an "island" amidst the locations in Paris by choosing a grid  in a low-density area of the dataset (in the southwest region), assigning the probability mass of the grids around  to 0, and dumping this cumulative mass from the surrounding region to , ensuring that the sum of the probability masses of all the grids remains to be 1.We call  as a vulnerable location in the map as it is isolated from the crowded area.To visualize the elastic behaviour of the mechanisms for locations in crowded regions, we consider another grid  in the central part of the map which has a high probability mass and has a highly populated surroundingwe refer to such a grid  as a strong location in the map. Figure 1 illustrates the selection of vulnerable and strong locations in the Paris dataset.For the mechanism derived from BA with a loss parameter , we know, by Remark 4, that the privacy parameter  is 2, which we use to tune the privacy level of LAP in order to compare the two mechanisms under the same level of geo-ind.Figure 2 illustrates the probability distribution of reporting a privatized point on the map by obfuscating the vulnerable and the strong locations with different levels of geo-ind -we vary the value of  to be 0.4, 1.2, 1.6, 2.
By comparing with the distribution of the true locations in Paris given by Figure 1, we observe that when the value of  is low (privacy is high), the reported location with BA is likely to be mapped to a nearby densely populated place.For example, with  = 0.2, the highest level of privacy considered in the experiments, the location reported by BA will most probably be around the most crowded region of Paris.As  increases, the location most likely to be reported by BA systematically moves to a densely populated region closer and closer to the true vulnerable location.LAP, on the other hand, always obfuscates every location around its true position in the map -varying the value of  changes the spread of the distribution around the true location.As explained in the introduction, this might be problematic as the vulnerable location is known to be isolated and, hence, even being reported somewhere nearby would potentially result in its re-identification.
For example, we would like to highlight the setting of  = 1.6 for the vulnerable location to show that the distribution of the location reported by LAP is almost completely around the true vulnerable point covering an area that is deserted, i.e., there is no realistic chance of someone being located in that region.Thus, despite providing formal 1.6-geo-ind, LAP fails to protect such a vulnerable location from being potentially identified.BA, on the other hand, does the job quite well, adhering to the principles of the elastic mechanism -it distributes the reported location in the crowded areas nearby providing a sense of camouflage amidst the many possibilities, in addition to 1.6-geo-ind.
In the case of privatizing the strong location, Figure 2b shows that both BA and LAP behave similarly by concealing the point around its true position.This would not give rise to a similar issue as for the vulnerable location because, by definition, the strong location  is already positioned in a highly dense region of the map and, hence, being privatized, it will still remain among the crowd with a high probability.Focusing on the utility of individual users, we note that due to theories from Nash equilibrium [41] and Hotelling's spatial competition [28], a huge fraction of the typical points of interest (POIs) like cinemas, theatres, restaurants, retails, etc. lie in crowded areas syncing with the distribution of population.Therefore, for an isolated point in the map that is located in some extremely unpopulated area (e.g.some forest or island far from the city), the closest POI is usually going to be in the nearest urban region, i.e., a region on the map with a high density of population.Suppose  is one such isolated location and let  BA and  LAP be the reported locations for  obfuscated with BA and LAP, respectively.Due to the elastic property of BA,  BA is likely to be at a nearby crowded location to , while  LAP is likely to be around the true location .Let  BA and  LAP be the nearest POIs from the reported locations  BA and  LAP , respectively.The most likely scenario is that  BA and  LAP are almost at a similar place under the assumption that typical POIs follow the distribution of the crowd and, therefore, a vulnerable user has to travel a similar distance from their true position in both the cases, except that under LAP, the privacy of  will be compromised much more than that under BA.

Statistical utility: BA vs LAP
Now we proceed to empirically compare the statistical utility of BA and LAP by performing experiments on the locations obtained from the Gowalla dataset for two different cities: Paris and San Francisco.In addition to the same setting for the Gowalla check-ins in Paris as considered in the experiments of Section 3.1, here we also test for 123,025 check-in locations from the Gowalla dataset in a northern part of San Francisco bounded by latitudes (37.7228, 37.7946) and longitudes (-122.5153,-122.3789)covering an area of 12Km×8Km discretized with a 24×17 grid.The locations were privatized with BA and LAP under varying levels of privacy -the loss parameter, , for BA ranged from 0.2 to 5.0, which implies that the value of the geo-ind parameter, , ranged from 0.4 (very high level of privacy) to 10.0 (almost no privacy).To account for the randomness in the process of generating the sanitized locations, 5 simulations were run for each value of the privacy parameter for obfuscating every location in both datasets.
Figure 3 reveals that BA possesses a significantly better statistical utility than LAP for a high level of privacy (for  ∈ (0.4, 1.4] and  ∈ (0, 1), i.e.,  up to 2.8 and 2, in Paris and San Francisco datasets, respectively).As the level of privacy decreases, the EMD of BA becomes worse than that of LAP.We conjecture that this is the price to pay for the added privacy provided by the elasticity of the mechanism.Eventually, the EMD between the true and the estimated PMFs converge to 0 in both mechanisms, as we would expect, fostering the maximum possible statistical utility with, practically, no privacy guarantee.
Summarizing the results from Sections 3.1 and 3.2, we can establish that: • in addition to providing a formal geo-ind guarantee, BA also gives an LPPM with an elastic distinguishability metric to enhance the privacy of vulnerable locations.• BA optimizes the trade-off between QoS and MI.
• the statistical utility for high levels of privacy is significantly better for BA than LAP.
Therefore, we conclude that BA is a key contender for providing a comprehensive notion of location privacy while preserving the utility of the data for both the users and the service providers.

DUALITY BETWEEN IBU AND BA
We now explore a relationship between BA and IBU which we found rather intriguing.For a metric space (X, ), let  be a random variable on X with PMF  X .Recalling the iteration of BA from ( 2) and ( 1 Comparing it with the iteration of IBU as in Definition 2.5, we observe that (5) BA is dual to IBU.Indeed, consider an exponential mechanism of the form C =  exp{−(, )}.Flipping the roles of  and  in ( 5), and replacing the input distribution  X with the empirical distribution in output to C, we obtain the iterative step of IBU.
Due to this duality between BA and IBU (illustrated in Figure 4) and taking advantage of the fact that BA converges [20], i.e., lim  →∞   exists, we obtain that also IBU converges.

PRIVIC: A PRIVACY-PRESERVING METHOD FOR INCREMENTAL DATA COLLECTION
To ensure that the produced mechanism is truly optimal, BA needs a good approximation of the prior distribution.In the beginning, we cannot assume to have such knowledge, but as the service providers incrementally collect data from their users, we can use these data to refine the estimation of the prior and get a better mechanism.These data, however, are obfuscated by the privacy mechanism and, hence, it is not obvious that the estimation of the prior really improves in the process.We show that this is the case, and, summarizing all results obtained for BA so far, we propose a method that facilitates the service providers to incrementally collect data and gradually achieve a high statistical utility with respect to the QoS.We shall refer to our proposed method for PRIVacy-preserving Incremental Collection of location data as PRIVIC.
The goal of PRIVIC is to construct an obfuscation mechanism that guarantees formal geo-ind, acts as an elastic mechanism, and eventually optimizes between MI and QoS, while producing, at the same time, a good estimation of the distribution on the data.
We shall consider locations sampled from a finite space X = { 1 , . . .,   }.Let the true distribution or true PMF on X (from which the users' locations are sampled) be  X .Note that we do not assume the knowledge of  X in our method.We assume that the new locations are sampled independently from the previous ones.This hypothesis is reasonable if the collection of the new data is enough separated in time from the previous one, otherwise, we would have a potential correlation between samplings due to the possibility that a user sends repeated check-ins from spatially closed locations.In any case, geo-ind, like DP, satisfies the property of sequential compositionality [29], which means that privacy degradation is under control.
In this work, to achieve geo-ind, we shall adhere to the Euclidean metric  E to measure the ground distance between locations.
: New noisy locations reported by users after obfuscating their newly sampled true locations with Ĉ() ;  ← {():  ∈ X}: Empirical PMF obtained from L by the service provider; πX ← GIBU Ĉ(1) ,  (1) , . . ., Ĉ( ) ,  ( ) ,  GIBU ; Ĉ ← BA( πX ,  0 , ,  BA ); Return: Ĉ, πX The initial distribution  0 does not need to be a uniform distribution, any fully-supported distribution would suffice for the process to eventually converge to an optimal mechanism.However, starting with a uniform distribution allows us to avoid any bias in the mechanisms produced in the intermediate steps.
Remark 6.We believe that the last step (3) is not really necessary: The combination of all estimations should already be the MLE of the true distribution, and this is also what we have witnessed in the experiments.However, applying this last step allows us to formally prove the converge to the MLE, using the results for GIBU in [26].
In the practical implementation of PRIVIC, we use the precision parameters  BA ,  IBU , and  GIBU to set the threshold of empirical Algorithm 3: iterative Bayesian update (IBU) Input: Privacy mechanism: C, Full-support PMF:  0 , empirical PMF from observed data: , precision:  IBU ; Output: MLE of true PMF:  ; Function IBU(C,  0 , ,  IBU ): convergence of BA, IBU, and GIBU, respectively.Let the privacy mechanism generated this way after  iterations, for fixed parameters  0 , ,  BA ,  IBU , and  GIBU , be functionally represented as ĈBA ( 0 ,  ).
Concerning statistical utility, it is important to ensure that IBU converges to the true distribution.As a matter of fact, IBU always converges to an MLE but the MLE may not be unique [26].More precisely, there can be more than one distribution that is the most likely input to the obfuscation mechanism, for a given empirical distribution on the noisy data.Thus, even though IBU converges, it may converge to a distribution different from the true one.This is a problem in the method by Oya et al. in [43] which computes the obfuscation mechanism via the algorithm of Shokri et al. [48].The resulting mechanism optimizes the trade-off between distortion and a Bayesian notion of privacy, but may not have a unique MLE, as illustrated in the example below.They probably did not realize the problem, because they relied on the flawed results by [2] according to which every mechanism would have a unique MLE.
The following example is a simplified version of the example given in [26] (Sections 3.1 and 3.2) which was aimed at showing the non-uniqueness of the MLE, and consequent convergence to the wrong distribution, in a more general setting.However, for the scope of our work, a simpler variant suffices.
Example 5.1.Consider three collinear locations, ,  and , where  lies in between  and  at a unit distance from each of them.Assume that the prior distribution on these three locations is uniform and that the constraint on the utility is that it should not exceed 2 /3.Then a mechanism that optimizes the QoS in the sense of [48] is the one that maps all locations to .However, this mechanism has no statistical utility, as the 's do not provide any information about the original distribution.Indeed, given  obfuscated locations (i.e.,  's) all distributions on ,  and  of the form  /,  /,  / with   +   +   = , have the same likelihood to be the original one.
Fortunately, our method does not have this problem, because the BA produces an invertible mechanism, and invertibility implies the uniqueness of the MLE [26].In particular, we are now able to show the convergence of PRIVIC as a whole using the results of [26].
Theorem 5.1.For any  ≥ 1, the mechanism generated by BA over X at the 'th iteration, seen as a stochastic matrix, is invertible.
Proof.In Appendix A. □ Theorem 5.2.PRIVIC converges to the unique MLE of the true distribution.

Proof. In Appendix A. □
To evaluate the statistical utility of ĈBA ( 0 ,  ) (cf.Section 6), we will measure the EMD between the true and the estimated PMFs at the end of  iterations of PRIVIC.Thus, the quantity ( πX ,  X ) parameterizes the utility of ĈBA ( 0 ,  ) for the service providers.We use the same Euclidean distance as the underlying metric for computing, both, the EMD and the average distortion -this consistency threads together and complements the notion of utility of the service providers and that from the sense of the QoS of the users.

EXPERIMENTAL ANALYSIS OF PRIVIC
In this section, we describe the empirical results obtained by carrying out experiments to illustrate and validate the working of our proposed method.Standard Python packages were used to run the experiments in a MacOS Ventura 13.2.1 environment with an Intel core i9 processor and 32 GB of RAM.Like in the previous experiments to compare the statistical utilities of BA and LAP, as elaborated in Section 3.2, we use real locations from the same regions in Paris and San Francisco from the Gowalla dataset [19,39].In particular, we consider Gowalla check-ins from (i) a northern part of San Francisco bounded by latitudes (37.7228, 37.  We implemented PRIVIC on the locations from Paris and San Francisco separately to judge its performance on real data with very different priors.In both cases, we ran our mechanism until it empirically converged.15 cycles of PRIVIC were required for the Paris dataset where each cycle comprised 8 iterations of BA until it converged to generate the privacy mechanism and 10 iterations of IBU until it converged to the MLE of the prior.For the San Francisco dataset, PRIVIC needed 8 cycles to converge with 5 iterations of BA and IBU each to converge in every cycle.The complexities and the run-times of BA and IBU are summarised in Table 2.In both cases, we assigned the value of the loss parameter signifying the QoS of the users, , to be 0.5 and 1.This was done to test the performance of PRIVIC in estimating the true PMF under two different levels of privacy.Each experiment was run for 5 rounds of simulation to calibrate the randomness of the sampling and obfuscation.In each  cycle of PRIVIC, across all the settings, BA was initiated with the uniform marginal  0 and a uniform distribution over the space of locations as the "starting guess" of the true distribution. With  = 1, BA produces a geo-indistinguishable mechanism that injects less local noise than that obtained with  = 0.5.As a result, PRIVIC obtains a more accurate estimate of the true PMF for  = 1 than for  = 0.5.However, in both cases, the EMD between the true and the estimated distributions is very low, indicating that the PRIVIC mechanism is able to preserve a good level of statistical utility.Moreover, for both Paris and San Francisco, PRIVIC seems to significantly improve its estimation of the true PMF with every iteration until it converges to the MLE.Comparing Figures 7 and  6b, we see that the estimations of the true distributions of the locations in Paris and San Francisco by IBU under PRIVIC for both the settings of the loss parameter are fairly accurate.However, as we would anticipate, the statistical utility for  = 1 is better than that for  = 0.5.Now we shift our attention to analyze the performance of PRIVIC in preserving the statistical utility and its long-term behaviour of the two datasets.Figure 8 shows us the EMD between the true distribution of the locations in Paris and its estimate by IBU under PRIVIC in each of its 15 cycles under the two settings of privacy ( = 0.5, 1).One of the most crucial observations here is that the EMD between the true and the estimated PMFs seems to decrease with the number of iterations and it finally converges, implying that the estimation of PMFs given by PRIVIC seems to improve at the end of each cycle and, eventually, it converges to the MLE of the prior of the noisy locations, giving the estimate of the true PMF.This, empirically, suggests the convergence of the entire method.This is a major difference from the work of [43] which, as we pointed out before, has the potential of encountering an LPPM which is optimal according to the standards set by Shokri et al. in [48] but     the EM method used to estimate the true distribution would fail to converge for that mechanism as illustrated in Example 5.1.We observe a very similar trend for the San Francisco dataset.Figure 9 shows the statistical utility of the mechanism generated by PRIVIC under each of its 8 cycles for  = 0.5 and  = 1.The explicit values of the EMD between the true and the estimated PMFs on the location data from Paris and San Francisco for both the settings of the loss parameter can be found in Tables 3 and 4 in Appendix B.
In the next part of the experiments, we set ourselves to dissect the trend of the statistical utility harboured by PRIVIC w.r.t. the level of geo-ind it guarantees.We recall that the higher the value of , the lesser the local noise that is injected into the data, and, hence, the worse will be the statistical utility, staying consistent with our observations in Figure 7.We continue working with the location data from Paris and San Francisco obtained from the Gowalla dataset in the same framework as described before.We consider  taking the values 0.1, 0.3, 0.5, 0.7, 0.9, 1, and for each value of the loss parameter, we run PRIVIC on both datasets using the same number of iterations as in the previous experiments.We adhere to 5 rounds of simulation for each  to account for the randomness generated in the obfuscation process.Figure 10 shows us that the difference between the true and the estimated PMFs under PRIVIC starts by sharply decreasing and then eventually stabilizes with an increase in the value of the loss parameter.In other words, as the intensity of the local noise decreases, we will end up estimating the unique MLE of the original distribution while optimizing MI and the users' QoS.Both the location datasets result in a Pareto curve showing a similar trend.This depicts an improvement of the estimated PMF until it converges to the true distribution.This observation complements the Pareto-optimality of MI with the maximum average distortion as studied in rate-distortion theory [46], and thus, we empirically weave together the two ends of utility with the information theoretical notion of privacy under PRIVIC.
Discussion.As a justification for the applicability and the working of our method, in a setting where the service providers periodically collect location data from clients, it is reasonable to assume that, over time, they would like to maximize their utility by accurately approximating the true distribution of the population for improving their service in various aspects (crowd management, security enhancement, WLAN hotspot positioning, etc.).BA, in addition to guaranteeing geo-ind, acts as an elastic location-privacy mechanism and optimizes between MI and the data owners' QoS when it initiates with the true prior.Therefore, as every iteration of PRIVIC improves the estimation of the original distribution, as seen in Figures 3a and 3b, which is used as the starting distribution in its next cycle, the overall privacy protection and its trade-off with QoS of the users will also improve, motivating the users and the service providers comply with PRIVIC to act in their best interests and, in turn, engaging them in a positive feedback loop to maximize the corresponding privacy and utility goals.

VULNERABILITY OF PRIVIC
In this section, we illustrate a potential vulnerability of PRIVIC when a subset of colluding users (adversarial users) intentionally deviate from the correct use of the protocol.
The attack consists in falsely reporting their location in order to alter the estimation of the true distribution and, consequently, the obfuscation mechanism produced by BA.Specifically, we study two cases: (i) adversaries reporting a crowded location (strong location) and adversaries reporting an isolated location (vulnerable location).
We used the real locations from the Paris dataset with the geospatially isolated "island" (as illustrated by Figure 1 presented in Section 3) representing a strong and a vulnerable location in the map denoted by points A and B in the figure, respectively.We performed the experiments with two different levels of formal geoindistinguishability ( = 0.8 and  = 1.2) considering different fractions of "adversarial data submissions" (i.e., adversarial users reporting their locations falsely to compromise the privacy of other users) in each case.
Although, in both cases (i) and (ii), we observed that with an increase in the fraction of adversaries, the probability mass assigned by BA (used to obfuscate the corresponding points locally) becomes higher in and around the corresponding reported points, the impact of privacy differs across both settings.For (i), the obfuscation distribution happens to be weighed heavily around the true crowded location (point B) by both BA and LAP (as illustrated by Figures 2b  and 2a) even without any adversarial users.This trend was seen to continue even when we assumed different levels of adversaries.
However, case (ii) represents a much more serious attack.As the number of adversarial users increases, BA and, in turn, PRIVIC become less potent to be able to protect the privacy of honest users who are genuinely located in an isolated location on the map.We also observe that BA and PRIVIC start behaving more like LAP.In particular, Figures 11a and 11b illustrate that the obfuscation distribution generated by BA satisfying geo-ind with  = 0.8 and  = 1.2, respectively, of the (non-adversarial) users located in point A assigns more and more weight to and around point A which, as a result, makes them more and more identifiable.This evaluation of the vulnerability of PRIVIC under adversarial data submission  essentially exposes a weakness of the elastic distinguishability metric.We plan to address this aspect and aim to make PRIVIC more robust against adversarial users in our future works.
for the data consumers.Specifically, we have proposed the Blahut-Arimoto algorithm as a location-privacy mechanism, showing its extensive privacy-preserving properties and its other advantages over the state-of-the-art Laplace mechanism for geo-ind.Further, we have exhibited its duality with the iterative Bayesian update and explored this connection to present an iterative method (PRIVIC) for incremental collection of location data with formal guarantees of geo-ind and an elastic distinguishability metric, while optimizing the QoS of the users and their privacy from an information theoretical perspective.Moreover, PRIVIC efficiently estimates the MLE of the distribution of the original data and, thus, yields a high statistical utility for the service providers.Finally, we have illustrated the convergence and the general functioning of PRIVIC with experiments on real location datasets.We believe that our results can be extended easily to other kinds of data, including those with high dimensions, and to other notions of distortion measures since the analysis carried out in this paper does not depend on the notion of distance used.' The residual entropy of  given  is defined as  ( | ) =  ∈Y   () ( | = ) = −  ∈Y   ()  ∈X   | ( |) log   | ( |), and, finally, the mutual information (MI) is given by:  ( | ) =  ( ) −  ( | ) = ∑︁  ∈X ∑︁  ∈Y (, ) log (, )   ()  () = P[|] for all ,  ∈ X, Y and  ∈ {1, . . .,  }.Denoting the random variable of the output when  ( )  is obfuscated with C ( ) as   ) for every  ∈ {1, . . .,  }.

Figure 2 :
Figure 2: Distribution of privatizing the vulnerable and the strong locations for different levels of privacy.Top-down, the rows illustrate the results for  = 0.4, 1.2, 1.6, 2, respectively.

Figure 1 :
Figure 1: Gowalla check-in locations in Paris with an artificially planted vulnerable point, , in isolation, and a strong point, , in a crowded area.

Figure 3 :
Figure 3: Statistical utility in terms of earth mover's distance (EMD) between the true and the estimated distributions for locations in Paris and San Francisco, under BA and LAP.

5 )Figure 4 :
Figure 4: Illustration of the duality between BA and IBU.

Figure 5 :
Figure 5: Illustration of the iterative process of PRIVIC
(a) Check-in locations in Paris and San Francisco (b) Density of the original locations from Paris and San Francisco

Figure 6 :
Figure 6: (a) visualizes the original locations from Gowalla dataset from Paris and San Francisco.(b) illustrates a heatmap representation of the locations in the two cities to capture the distribution of the data.

Figure 7 :
Figure 7: Visualization of the estimated true distribution of the locations in Paris ((a) and (b)) and San Francisco ((c) and (d)) by PRIVIC after its convergence; the first column is for  = 0.5 and the second column is for  = 1.

= 1 Figure 8 :
Figure 8: (a) and (b) show the EMD between the true PMF of the Paris locations and its estimation by PRIVIC in each of its cycle for  = 0.5 and  = 1, respectively.

Figure 9 :
Figure 9: (a) and (b) show the EMD between the true PMF of the San Francisco locations and its estimation by PRIVIC in each of its cycles for  = 0.5 and  = 1, respectively.

Figure 10 :
Figure 10: (a) and (b) illustrate that EMD between the true and the estimated distributions of the locations in Paris and San Francisco, respectively, after the empirical convergence of PRIVIC for the different values of the loss parameters .

Figure 11 :
Figure 11: Effect on the privacy provided by BA to obfuscate the geo-spatially isolated location (Location A as in Figure 1) for different fractions of adversarial users who intentionally report their locations falsely under two different levels of formal geo-ind guarantees by BA.

Table 2 :
Run-time and complexity of BA and IBU in each cycle of PRIVIC