Differential Privacy for Black-Box Statistical Analyses

We formalize a notion of a privacy wrapper, defined as an algorithm that can take an arbitrary and untrusted script and produce an output with differential privacy guarantees. Our novel privacy wrapper, named TAHOE, incorporates two design ideas: a type of stability under subsetting, and randomization over subset size. We show that TAHOE imposes differential privacy for every possible script. When the data alphabet is finite and small enough, TAHOE can be practically run on a single computer. Performance simulations show that TAHOE has greater accuracy than a benchmark algorithm based on a subsample-and-aggregate approach for certain scenarios and parameter values.


INTRODUCTION
Consider a scenario involving a holder of personal data and a researcher. The researcher has written an analysis script and would like to apply it to the data in order to gain insight. The data holder is concerned with the privacy of individuals in the data, and may not necessarily trust the researcher. Furthermore, the data holder may have contractual obligations that prevent them from sharing information about individuals in the data. The researcher gives the data holder their analysis script, but the data holder does not have the resources or expertise to analyze the code for privacy threats. Instead, the data holder desires a simple way to add privacy guarantees to the researcher's script, without looking inside it, so that the output can be safely returned. This motivates the central question of this study: What privacy guarantees can be added to a statistical analysis script from an untrusted source, treating it in a black box fashion? Although many organizations are interested in increasing researcher access to data, the costs given current technology can be considerable. For example, the US Census Bureau has a program to allow researchers to interact with raw data. To maintain privacy, however, the Bureau maintains a set of secure locations around the country and puts all applicants through a background check. 1 In another example, Facebook and Social Science One teamed up to 1  make data about shared URLs available to researchers using differential privacy [18]. However, to work with the most sensitive data, researchers must be pre-approved by a university review board, pass an application process, and undergo monitoring -even at the level of individual keystrokes. 2 In contrast with existing approaches, the goal of this study is to enable a researcher to work with private data in an automatic fashion, without the need for screening procedures to establish trust. We will refer to a system that achieves this as a privacy wrapper. Akin to a function wrapper in programming, the idea is to write an algorithm that mediates all interaction with the researcher's script, producing output that is based on the behavior of the script, while also yielding strong privacy guarantees.

Considerations for Untrusted Code
Running code from untrusted sources introduces special challenges to the data holder. To borrow terms from the security literature, some researchers that submit scripts may be honest, meaning they abide by the intended use of the system. Others may be considered malicious, meaning that their true goal is to extract private information from the data. A lack of trust suggests that all researchers must be considered potentially malicious.
A malicious researcher has a variety of techniques to extract private information from a dataset. Suppose a researcher is looking for information about a target individual, . The researcher could write a script that hides 's income in the trailing digits of the output. They could code the script so the sum of the digits in the output is even if is in the data, and odd otherwise. They could make the script deliberately fail if a certain medical diagnosis appears in 's records. Providing further challenges, a malicious researcher may obfuscate their code, making it difficult to determine its true purpose. Obfuscation may be performed by hand, or a variety of automated obfuscation tools may be used by researchers without special expertise [2,20].
Following common practice, the data holder might hope to apply differential privacy by adding noise related to the researcher script's sensitivity [11]. Informally, sensitivity refers to how much the output of a script can change, if one were to change a single row of the data. Unfortunately, determining the sensitivity of a researcher script may be difficult, or even impossible.
For an instructive example, again suppose that that an adversary is interested in a specific target, , which may or may not be in the data. The adversary writes a script that outputs 0 unless is in the data, in which case the script outputs some large ≫ 1. If is not in the data, there may be no way to detect that the script is capable of outputting any value other than 0 -especially if the script is obfuscated. The adversary may thus hope to trick the data holder into assuming the script has low sensitivity, leading them to add a small amount of noise.
There is also a theoretical reason to believe that determining the sensitivity of a script is difficult. In the language of the theory of computation, the property of meeting a sensitivity bound is semantic and non-trivial. Rice's theorem tells us that the problem of deciding if arbitrary code meets such properties is undecidable [27]. This suggests that a traditional strategy of adding noise based on sensitivity cannot work in all cases.

Motivating a Novel Privacy Wrapper
To overcome the challenges of untrusted code, this study introduces a novel privacy wrapper, called TAHOE. TAHOE treats the researcher script as a black box and uses subsets of data to ensure a type of sensitivity bound. It has the following properties: (1) Privacy: Given any researcher script, the mechanism generated by the wrapper meets approximate differential privacy.
(2) Accuracy: The output of the wrapper is related to the researcher script in the sense that it is found by adding (predetermined) noise to the output of the script on some subset of data. (3) Flexibility: TAHOE places no limitation on the code written by a researcher. Any script that returns a real vector, including those that may terminate without returning an output, can be used.
A disadvantage of our algorithm is that the runtime can be prohibitively large, depending on the privacy parameters chosen. As we will show in Section 6, TAHOE can be optimized to run efficiently for the special case of a finite data alphabet.
We will describe TAHOE formally in Section 4. Its design is motivated by two central ideas, which we present at an intuitive level.
1.2.1 Remove influential individuals. An individual may be called influential, roughly speaking, if removing them causes a large change in the output of a script. For example, suppose that a researcher submits a script to compute the mean age of participants in a medical trial. If a target individual happens to be very young, an adversary may be able to distinguish whether they are in the data or not by observing the mean. On the other hand, if a privacy wrapper removes the influential individual before running the script, the presence of the individual can no longer be determined in such a direct manner.
Based on this observation, we consider looking for a subset of data, such that removing further individuals does not change the output very much. This intuition will be formalized in our notion of stable subsets. Definition 1. (Informal Version of Definition 7) Given dataset , researcher script , and threshold ∈ R, a subset ⊆ is stable if for any , ⊆ of some minimum size, returns a real vector for both and , and || ( ) − ( )|| 1 ≤ .
A formal definition of this property is given in Section 4. Based on our intuition, we search for algorithms that run the script on a stable subset and add noise to protect privacy.

Randomize number of individuals removed.
A further problem arises when basing a privacy wrapper on stable subsets: the existence of a stable subset of a given size can itself reveal information about individuals in the data. Suppose that a wrapper was programmed to find a stable subset of some fixed size , and release some output if it exists. If a stable subset of size does not exist, the system terminates without returning a value, denoted by the non-response symbol ⊥. A malicious researcher could exploit this wrapper design to learn whether a target individual, , is inside the data. An example of a script that does this is given in Algorithm 1.

Algorithm 1: Exploiting fixed through non-response
Input: • Dataset ⊆ If ∈ , the script will output 1 for any subset of size that contains . Moreover, any such subset is stable, because if any rows are removed, the size will be less than , so the script will again return 1. On the other hand, if ∉ , the script will return ⊥ for any subset of size , so no stable subsets of size exist. Thus, a stable subset of size exists if and only if ∈ . Therefore the adversary will know whether the target is in the data, based on whether or not the wrapper returns an output.
To defend against attacks like this, our second idea is to randomize over the size of subsets we consider, . Intuitively, by making the number of individuals we remove random, we may conceal how many individuals are removed because they are influential.
In Section 4, we define a distribution , used to select a subset size . The size ranges from − to , where is the size of the dataset and is a parameter representing the maximum number of individuals removed. The specific form of is chosen to optimize privacy parameters. It is necessary that the internal parameter is kept secret, in order our privacy guarantees to hold.

Summary of Results
Our algorithm TAHOE (Trim And withHold Or Execute), which is formally defined in Section 4, combines the two previous ideas. At a high level, TAHOE begins by selecting a random size from distribution . TAHOE then checks if any stable subsets of size exist for a dataset , which can be viewed as trimming influential individuals from the data. If no stable subset of size exists, TAHOE withholds from responding, represented by non-response ⊥. Otherwise, TAHOE selects a stable subset of size uniformly at random and executes the researcher's script on it, with the addition of Laplace noise.
We prove in Theorem 1 that TAHOE imposes ( , )-differential privacy. More precisely, fixing any researcher script , the resulting behavior of TAHOE can be viewed as a function from the set of possible datasets to distributions over possible outputs. The differential privacy bound applies to this function, and it holds no matter what script is chosen. Even if every individual is highly influential with respect to script , differential privacy continues to hold, though the lack of stable subsets implies that TAHOE will always return ⊥ in this case.
Without restrictions on the set of possible data entries D, TAHOE is too computationally expensive to be practical. We provide a big-O bound in Proposition 5, and argue that (when maintaining < 1/ ), the runtime is superpolynomial in the size of the dataset.
By contrast, we consider scenarios in which the data alphabet has finite size, |D| = . In this case, we describe optimizations that allow TAHOE to run efficiently. Proposition 6 shows that the runtime now becomes O − ln( −1 ) as , → 0, where is a bound on the runtime of . In Section 7 we describe the results of experiments involving finite data alphabets, in which TAHOE is able to run on a single computer.
In general, measuring the accuracy of TAHOE's output is difficult, because its behavior depends crucially on what is in the script . Nevertheless, some guarantees can be derived for special cases. Theorem 2 shows that when returns a statistically consistent estimate for a population parameter, and TAHOE is configured to never return ⊥, TAHOE's output will be consistent as well.
To better understand the accuracy of TAHOE's output, we perform a set of simulation exercises in Section 7. Each exercise is based around an honest researcher that wishes to perform a stylized analysis task involving a finite data alphabet. As a benchmark for comparison, we also simulate the same tasks under the GUPT system, which is another algorithm that can fill the role of a privacy wrapper [23]. While these systems are very different from each other, we establish guidelines to make our comparison as fair as possible.
The first task we consider is generating a histogram showing the proportion of each possible data entry. In our experiments, we find that GUPT outperforms TAHOE for small datasets. However, above a certain data size, TAHOE's output becomes more accurate, reflecting favorable scaling properties. TAHOE also performs relatively well as the size of the alphabet increases, but less well as approaches 0.
The second task we consider is computing the sample entropy of the data. We once again find that GUPT outperforms TAHOE for small datasets. TAHOE does gain an advantage as the dataset grows large, but it takes a much larger dataset (above 10 6 rows) before TAHOE becomes more accurate than GUPT. We argue that this is related to the fact that entropy is a one-dimensional quantity, and to the shape of the entropy function.

RELATED WORKS
Differential privacy was introduced by Dwork et al. as a rigorous standard for mechanisms that compute real-valued statistics from personal data [8,9,11]. The authors pioneered privacy analysis based on global sensitivity, which is defined as the maximal change in a statistic resulting from a one row change to any dataset. Subsequent papers developed variants of the definition of differential privacy [3,4,6,13,22] whose analyses leverage global sensitivity. A number of studies have since developed differentially private mechanisms leveraging local sensitivity, which is often much smaller than global sensitivity [15,24,31]. Such approaches cannot immediately be applied to black box researcher scripts, as neither local nor global sensitivity may be estimated.
Within the differential privacy literature, three high-level strategies can be applied to black boxes. We will refer to these as subsample and aggregate, Lipschitz extension, and restricted language.
In the subsample and aggregate approach, a dataset is divided into (small) subsets and the researcher script is run on each piece. The results are then aggregated together in some way to yield a final answer. Privacy protection emerges from the fact that a single individual can only affects the output resulting from a single subset (or a small fraction). Nissim, Raskhodnikova, and Smith were the first to outline this strategy, providing an example using the median as the aggregation function with noise calibrated to smooth sensitivity [24]. GUPT is a similar system, in which the mean is used as an aggregation function [23]. A feature of both of these systems is that a bound is required on the script output. As an alternative, a script can be run on subsamples, then the median can be released using the the Propose-Test-Release framework of Dwork and Lei [10]. Such a system does not require any bounds on the script output, but there is always some probability that the script returns a non-response character ⊥ instead of returning an output. The privacy wrapper we propose similarly requires no bounds on script output, and returns ⊥ in some cases. In cases when the output of a script is categorical instead of metric, aggregation can be performed through noisy voting [16,26]. Compared to this lineage, our privacy wrapper also examines subsets of data, but does not require an aggregation step. Instead, our work is focused on finding subsets of data with favorable privacy characteristics prior to computing statistics.
In the Lipschitz extension approach, a researcher script is replaced by an approximating function that has low global sensitivity but matches the output of the script on at least one reference dataset. The resulting function is known as a Lipschitz extension, and may be useful if its output matches the original script on typical datasets. Furthermore, the low global sensitivity means that a small amount of noise is sufficient to impose differential privacy. This approach was pioneered by Jha and Raskhodnikova, who developed algorithms for constructing Lipschitz extensions when each data entry takes on a finite set of values [14]. The original script is run on possible databases in a predetermined order, and the output is used to calibrate the extension. This approach was extended by Cummings and Durfee, who presented algorithms for infinite database domains [5]. In their system, a Lipschitz extension is constructed by first running the source script on the empty database. The output for any other database is found recursively, based on the output of its subsets. A theme that emerges from this work is that a Lipschitz extension is accurate when the output of the original script on the starting reference dataset is close to the output on the actual data. Another feature is that Lipschitz extensions take exponential time to compute, unless the researcher script is a white box with additional structure that can be exploited.
Similarly to the above work, our privacy wrapper leverages subsetting to avoid searching through the entire space of possible datasets. However, we only consider subsets above a minimum size. As long as the researcher script has low sensitivity over this local domain, our privacy wrapper will output accurate results, even if the sensitivity bound is broken on other typical datasets. Additionally, the number of relevant subsets may be far fewer than the number considered under a Lipschitz extension approach, though it is still too large for practical computation when the data alphabet is infinite.
In the restricted language approach, researchers are permitted to design their own script, but must use a limited language that restricts their access to data. An example of such a system is PINQ, which presents programmers with a SQL-like interface with privacy guarantees [21]. Other differentially private SQL systems have been subsequently developed [15,19,32]. In a similar fashion, Kifer et al. propose an architecture in which access to data is mediated by a privacy layer that implements differentially private mechanisms [17].

PRELIMINARIES
Let D be the set of possible entries that represent one individual in a dataset. A dataset is represented as a multiset of finite size with entries in D. Let D ★ be the set of all possible datasets. We say two datasets 1 , 2 ∈ D ★ are neighbors if one can be obtained from the other by switching exactly one element. That is, 1 and 2 are neighbors if there exists ∈ 1 and ∈ 2 such that Given a nonempty set Ω, let Δ(Ω) be the set of all probability distributions over Ω. 3 We define an algorithm as a function : Here, ⊥ is used to represent non-response, meaning an algorithm fails to return a real value. For ease of exposition, we define a researcher's script as a deterministic function : All results in this paper can be extended to researcher scripts that are randomized by adding a sampling step to the wrapper. Let R be the set of all deterministic algorithms.
A wrapper is a function : D ★ × R → Δ(R ∪ {⊥}) that takes a dataset and an algorithm as input and outputs a noisy response based on and . We extend the notion of differential privacy to wrappers as follows.
Our privacy wrapper will leverage the -dimensional Laplace distribution.
A necessary and sufficient condition to determine whether two Laplace distributions with the same scale are ( , 0)-indistinguishable is given below.

ALGORITHM DESCRIPTION: TAHOE
As noted in Section 1.2, our privacy wrapper TAHOE leverages stable subsets of random sizes to impose ( , )-differential privacy.
In this section, we formalize each of these notions and specify our algorithm.

Stable Subsets
Before defining stability, a preliminary concept is that of responsive datasets. Essentially, this requires that a researcher script does not return ⊥ when applied to the dataset or any subset of a minimum size.
Note that any multiset of size less than is trivially responsive by definition. Our definition of stable subsets refines responsiveness to convey the intuition that removing further individuals does not greatly affect script output. Because , , , and will not change in this study, we will omit them and simply refer to the set as stable. Additionally, any multiset of size less than is trivially stable by definition.
It will be useful to rewrite stability in terms of the following definitions. Let be the set of all vectors of the form ( 1 , 2 , ..., ) where ∈ {−1, +1} for all . For ∈ and ∈ D ★ of size at least , define ( ) and ( ) as follows: Now stability can be expressed as follows: The proof of this proposition can be found in the appendix. From a computational perspective, and satisfy a recurrence relationship which will enable us to check the stability of subsets more efficiently.
Proposition 2. For ∈ ★ of size at least , the following hold: Stable subsets obey certain algebraic properties that facilitate algorithmic analysis. As the following lemma shows, subsetting a stable subsets creates another stable subset.
Proof. Apply Lemma 3 with as is and ′ = ∩ ′ . □ Given data ∈ D ★ and researcher script , let ( , ) be the size of the largest stable subset of . The following lemma relates to the idea of neighboring datasets. Proof. Suppose 1 is a maximal stable subset of 1 . Then 1 ∩ 2 is a stable subset of 2 by Corollary 2. Further,

Randomized Subset Sizes
Our algorithm randomly chooses how many individuals to exclude when searching for a stable subset, making it harder for an adversary to leverage the size of stable subsets to extract information from the data. Throughout this paper, we will use to represent the size of a dataset, and to represent the maximum number of entries that a privacy wrapper may exclude. In our algorithm, will be set to the output of the following helper function Note that decreases in both and .
To select a size between − and , we define the distribution ( , , , ) as follows: ( , , , )( ) equals for ∈ { − , ..., } and 0 otherwise, where ′ is a constant selected so that the total probability sums to 1. As we will see in the following section, the normalization constant ′ of is closely related to the privacy parameter in the ( , )-differential privacy guarantee of TAHOE. We will normally omit the arguments to for readability purposes. A specific example of the shape of when = 100, = 42, = 0.1, and = 0.01 is provided in Figure 1. For these values, ′ ≈ 0.0098. is designed to rise and fall in an exponential manner. In the privacy analysis of Section 5, it will be helpful to relate the probability at to the total probability below and the total probability above . This is done by the following two propositions.

Our General Algorithm
Our specific privacy wrapper TAHOE (Trim And withHold Or Execute) described in Algorithm 2 takes as input a researcher script : D ★ → Δ(R ∪{⊥}) and a dataset ∈ D ★ of size . Additionally, TAHOE uses the following auxiliary parameters: > 0, > 0, < /4, and > 0. Configuring TAHOE is complicated by the large number of parameters, and by the different ways one might want to assess its behavior. In general, a multi-way trade-off exists among how much noise TAHOE adds, how likely TAHOE is to return ⊥, the privacy parameters, and how much sampling variation is introduced by subsetting. The relationship among these quantities depends on what is in the researcher script, and therefore cannot be known precisely. In the following table, we provide some guidance by discussing the effects of each parameter in general terms. Table 1: Description of Parameters and : These are the differential privacy parameters, which are chosen by the data holder. Decreasing and corresponds to a stronger privacy guarantee. However, setting these parameters too small presents two disadvantages. One disadvantage is that a smaller and results in a higher , requiring the algorithm to consider smaller subsets of data. At least in typical cases, we expect that increasing will make it harder to find stable subsets, resulting in a higher probability of returning ⊥. Moreover, the requirement = ℎ( , , ) < ( − 1)/2 places a lower bound on and . A second disadvantage is that a small also requires to be smaller, which may cause TAHOE to return ⊥ with high probability, or require more noise.
: This parameter is chosen by the researcher and represents the scale of added Laplace noise. Decreasing results in more accurate output; however, this also makes the definition of stable subsets stricter, which increases the probability that TAHOE halts without returning an output. We propose two heuristics for setting . One is to decide how much noise can be added to the result while preserving its utility. In our experiments, we found that the greatest source of error was the Laplace noise added in Step 5 of the algorithm. Further, the standard deviation of this noise is given by The other heuristic we propose is to compute a value for such that TAHOE is guaranteed to never (or rarely) to output ⊥. We give examples of such computations in Section 7.
: This parameter is chosen by the data holder, and controls how indistinguishable ( ( ), ) and ( ( ), ) must be when and are subsets of a stable set. If is too small, TAHOE will tend to return ⊥ with high probability. If is too close to /4, will become large, increasing the probability of returning ⊥. Additionally, setting too high will violate the requirement that = ℎ( , , ) < ( − 1)/2. In our simulation studies, we have found that = /5 is a reasonable rule of thumb for all scenarios we considered.

PRIVACY ANALYSIS OF TAHOE
In this section, we show that our privacy wrapper TAHOE imposes ( , )-differential privacy.
Throughout this section, we will use T ∈ Δ(R ∪ {⊥}) to denote the probability distribution induced by TAHOE on dataset , researcher script , and parameters , , , and . Since is the only input that will change in the upcoming proofs, we omit writing the other inputs for readability purposes. For any ∈ D ★ , write T (·| ) to represents the conditional probability distribution of the wrapper when has been chosen in Step 1 of Algorithm 2. Theorem 1. TAHOE imposes ( , )-differential privacy.
Before providing the proof, we sketch an outline of our argument. In order to show that TAHOE imposes ( , ) differential privacy, we first show in Lemma 5 that the value of set in TAHOE produces ′ < . Since ( , ′ ) differential privacy implies ( , )differential privacy, it is sufficient to show that TAHOE imposes ( , ′ )-differential privacy.
Next, we proceed by considering the behavior of TAHOE on two neighboring datasets 1 and 2 . If there are no stable subsets of size − for both of these datasets, then TAHOE returns ⊥ with probability 1 on both datasets. Hence, the behavior of TAHOE is constant for all neighboring databases, so TAHOE trivially imposes ( , ′ )-differential privacy.
On the other hand, if there is a stable subset of size − of at least one of the datasets, we can pick any stable subset and define a reference distribution = ( ( ), ). For any value of for which TAHOE does not return ⊥, we then show in Lemma 6 that both conditional distributions 1 (·| ) and 2 (·| ) are (2 , 0)indistinguishable from . Using this result, we finish the proof by showing that, for any measurable set ⊆ R ∪ {⊥}, 1 ( ) ≤ exp( ) 2 ( ) + ′ based on two cases: when ⊥∉ and when ⊥∈ . Proof. We will show that TAHOE imposes ( , ′ )-differential privacy, where ′ is the normalization constant in the definition of . This is sufficient because of the following lemma, which is proven in the Appendix. Let ⊆ R ∪ {⊥} be measurable, and let 1 and 2 be neighboring datasets. We will prove the bound, If there are no stable subsets of size − of both 1 and 2 , then T 1 and T 2 are the same distribution (giving probability 1 to ⊥), so the bound follows immediately. If there are any stable subsets of size − of either 1 or 2 , choose one and call it . Define reference distribution = ( ( ), ). . This means that TAHOE will return ⊥ for any > and will not return ⊥ for any < .
For ∈ {1, 2} the law of total probability implies We consider two cases: Case ⊥∉ : For > , TAHOE returns ⊥, so T ( | ) = 0. Hence, where the inequality follows from Lemma 6. Additionally, where the last inequality results from Lemma 6. Combining these two observations, where the last inequality follows since ( ) ≤ 1. Plugging in ( ) from Proposition 3 and simplifying bounds the ratio by exp( ). Case ⊥ ∈ : By Lemma 6, T (·| ) ∼ 2 ,0 (·) when T doesn't return ⊥. Since > 4 , the fraction is greater than 1, so we can subtract the first terms from the numerator and denominator and maintain the inequality:

RUNTIME ANALYSIS OF TAHOE
To analyze the execution time of TAHOE, we let represent the worst case runtime of . 5 Without any restrictions, TAHOE takes prohibitively long to execute. The reason is that, in the worst case, Step 2 of the algorithm requires executing the researcher script on every possible subset of data with size ranging from − 2 − 1 to . The following proposition places a bound on the execution time without any restrictions on |D|. To attain this result, we must specify how the input parameter changes as the other parameters change. We suppose that the data holder sets proportional to , which is consistent with the rule of thumb ( = 5 ) we use in our simulations in Section 7.
A proof of this proposition can be found in the Appendix. For fixed privacy parameters, and , and fixed , the runtime of TAHOE is polynomial in . However, a best practice first suggested by Dwork, et al. is to maintain < 1/ [11]. This causes the runtime to be superpolynomial in .

Runtime Analysis for Finite Alphabets
While the runtime bound of Proposition 5 is impractical for all but small datasets with weak privacy parameters, there is one situation in which TAHOE can be modified to run much faster. In particular, when the set of possible data entries D is finite of size , the number of unique subsets of ∈ D ★ that TAHOE must consider is greatly reduced. Intuitively, when ≥ , by the pigeonhole principle some of the rows in the dataset will necessarily hold the same value; removing any one of them results in the same subset as removing any other.
More formally, when D = { 1 , ..., } any dataset ∈ D can be written in histogram form as = ( 1 , ..., ) where denotes the number times appears in the dataset and =1 = [12]. Any subset ⊆ of size at least − is represented by a histogram To optimize TAHOE for this setting, we index subsets of according to their histograms, in order to avoid duplicating computations on equivalent subsets. Thus, in Step 2(a), TAHOE must run on every histogram of size − 2 − 1. By Lemma 7(a), there are at most . When |D| = , the runtime of TAHOE is given in the following proposition, with the proof deferred to the appendix. Notice that the dependency on has disappeared, and a dependency on has been introduced. At an intuitive level, the histogram representation provides a lower dimensional representation of subsets, reducing the number of subsets that need to be examined. In upcoming Section 7, we run experiments involving data alphabet sizes from 2 to 4 and find that the algorithm can readily run on a single machine.
For finite data alphabets that are too large for TAHOE to run on a single machine, parallel computation can expand the applicability of the algorithm. In particular, the most computationally expensive steps are 2(a) and 2(c)(I). Both of these can be distributed over a parallel cluster.

ACCURACY ANALYSIS OF TAHOE
The performance of privacy-protecting systems is often assessed via a statement that relates the privacy level to accuracy, which may be measured in terms of mean squared error (MSE), standard error, or some other metric. For researcher scripts in general, there is no way to know what accuracy TAHOE provides for given parameter values, because that depends crucially on what is inside the script. However, we are able to provide some results related to accuracy for specific scripts, or classes of scripts.
Our first result motivates the use of TAHOE for inferential statistics. When performing inference tasks, it is common to assume that the data entries are drawn independently from a probability distribution [29,30]. Let F be a distribution over the set of data values D, described by a parameter ∈ R . Further, let represent a random dataset of rows drawn independently and identically from F. A useful set of scripts are those that estimate consistently.
When has bounded global sensitivity, it is possible to configure TAHOE so that it preserves consistency. For this discussion, we allow that parameters , , , may be chosen as functions of . Three conditions are required. First, and must be chosen so that TAHOE never returns ⊥. 7 Second, the noise scale must approach zero as → ∞. Third, must be fixed or grow slowly enough that lim →∞ − = ∞. Theorem 2. If is consistent for parameter and TAHOE is configured to never return ⊥, lim →∞ = 0, and lim →∞ − = ∞ then the random variable formed by composing TAHOE with is consistent for parameter .
A proof is given in the appendix. A similar result holds for researcher scripts that implement unbiased estimators of .

Accuracy Simulations
To gain insight into additional accuracy characteristics of our privacy wrapper, we consider a set of example scripts that an (honest) researcher may want to run. To make the runtime practical, we chose stylized analysis tasks involving finite data alphabets. For each script, our goal is to understand how well our privacy wrapper performs, and compare it to a benchmark.
For this exercise, the benchmark that we select is GUPT, which follows the sample-and-aggregate strategy [23]. To summarize the algorithm, GUPT partitions the dataset into 0.4 pieces, runs the script on each piece, then releases the average output with Laplace noise. Unlike TAHOE, GUPT has the advantage of using pure differential privacy. As an additional benchmark, we consider the Laplace mechanism with global sensitivity as a white-box privacy preserving algorithm [11]. This provides a measure of the cost arising from treating a script in a black-box fashion.
A comparison of TAHOE and GUPT is complicated by fundamental differences between the algorithms. First, the parameters of each algorithm are different, and setting them requires judgement on the part of the researcher. Second, GUPT guarantees pure differential privacy while TAHOE uses approximate differential privacy. Finally, GUPT always provide an output, while TAHOE sometimes returns ⊥, making it difficult to compare accuracy on the same scale. These differences between the systems prevent us from comparing them "straight out of the box. " To configure TAHOE and GUPT and make the comparison as fair as possible, we adopt a set of guidelines. First, we follow simple heuristics to set parameter values when possible. This is aligned with our view that a privacy wrapper should not require specialized knowledge on the part of a researcher. Second, we set parameters in a way that makes the privacy wrappers as comparable as possible. 6 We define a topology over R ∪ {⊥} using the Alexandroff Extension [7] of the standard topology of R with the singleton {⊥}. 7 By Lemma 2, a necessary and sufficient condition to ensure TAHOE does not return ⊥ is to set and such that | | ( ) − ( ) | | 1 ≤ for all sets , , such that , ⊆ ⊆ , where | | ∈ { − , ..., } and | |, | | ≥ − .
In the case of GUPT, the main parameter we must set is the bounding rectangle for the script output. For the sake of simplicity, we use the most extreme values that are mathematically possible, meaning that the output is never censored. In the case of TAHOE, we must set the noise scale . For each script, we compute a value of large enough to guarantee that the algorithm never returns ⊥. This is done so that both systems always output real vectors, and accuracy can be measured using a common metric (standard error).
An advantage of GUPT that we do not capture in our results is that it uses pure differential privacy. We nevertheless use a common for all systems. We allow TAHOE to have the extra advantage of approximate differential privacy, but require that < 1/ [1, 25].

Normalized Histogram
The first script we consider is one that returns a normalized histogram representing the data. Given alphabet D = ( 1 , ..., ) and dataset ∈ D , define the (un-normalized) histogram H ( ) = ( 1 , ..., ), where denote the number of times appears in . A script to return a normalized histogram is then specified by P ( ) = H ( )/| |.
At a high level, the normalized histogram is well-matched with a subsample-and-aggregate strategy, because the proportion of each on a small subset of data tends to represent the overall proportion of well [24]. In fact, as long as the subsets have equal size, GUPT introduces no sampling variation, so the only error comes from added Laplace noise. To configure GUPT, we set the bound on script output in each dimension to be [0, 1].
Unlike GUPT, TAHOE introduces both sampling variation and added noise. Several parameters must be configured to run TAHOE. We assume that the end user follows our recommendation of setting = /5. We follow the practice of setting = 1/( + 1).
To guarantee that TAHOE never returns ⊥, it is necessary and sufficient to make sure it doesn't return ⊥ when = is chosen in Step 1. This is true whenever the entire dataset is stable. Equivalently, for any subsets , ⊆ , with size at least where the last inequality follows as | |, | | ≥ − 2 − 1. Thus setting = 2(2 + 1)/(( − 2 − 1) ) where = ℎ( , , ) is sufficient to guarantee that TAHOE never returns ⊥.  Given these configurations, we perform experiments in which GUPT and TAHOE are each executed on a randomly generated dataset. Each row of data is an independent draw from a discrete uniform distribution. The parameter is set to values ranging from 0.25 to 2 and is set to values in {10 2 , 10 3 , 10 4 , 10 5 }. The alphabet size is set to values in {2, 3, 4}. To estimate error, each experiment was replicated 50 times. All experiments were completed on a single computer with TAHOE and GUPT implemented in Python.
Each panel of Figure 2 displays the total standard error (rootmean-square error) for different parameter values. The error for each histogram is measured using the L1 norm, and includes both sampling error and added noise as a function of dataset size. In practice, we found that sampling variation accounted for less than 1% of the total error in typical simulations.
First, as can be seen in Figure 2 (a) and (b), the error of every system decreases as the data size grows along the horizontal axis. For small datasets, GUPT outperforms TAHOE, featuring an order of magnitude less error in some cases. In this range, our internal parameter is large relative to , meaning that TAHOE must consider relatively small subsets. These subsets can have significantly different proportions, so a lot noise is required to make them indistinguishable. By contrast, the number of subsets in GUPT is large as a fraction of , reducing the sensitivity of the output to any one subset.
For larger datasets, TAHOE's advantages grow. The internal parameter grows slowly compared to , meaning that the subsets under consideration have more similar proportions. Beyond a certain dataset size, TAHOE features less total error than GUPT.
Next, as also seen in Figure 2 (a) and (b), increasing the alphabet size from 2 to 3, and to 4, leads to more error for both privacy wrappers. As the dimensionality increases, TAHOE shows a relative advantage, in the sense that it takes fewer rows of data before TAHOE's error drops below that of GUPT. In the case of TAHOE, the noise scale does not depend on . Since the same amount of Laplace noise is added to each dimension of the histogram, the total added noise increases linearly with . By contrast, the noise scale used in GUPT is tied to the L1 diameter of the output space, which is proportional to . Summing over the histogram dimensions, the total added noise is therefore proportional to 2 .
Finally, as seen in Figure 2(c), decreasing results in greater error for every system. In the case of GUPT, the curve is approximately hyperbolic, since the added noise scale increases with 1/ . On the other hand, TAHOE's total error increases faster as approaches zero, crossing the curve for GUPT. Examining the expression for TAHOE's noise scale , there is an in the denominator, and is set proportionally to , which would suggest a hyperbolic relationship. However, there is an extra dependency on , since increases as decreases, driving higher for small values of .

Sample Entropy
We now turn our attention to statistics other than the normalized histogram. Before we begin, we note that a histogram script could be used as a building block to compute other statistics. A researcher could first submit the normalized histogram script from the previous section to TAHOE, then compute a statistic of interest directly from the privatized histogram. While this approach is general, there are situations in which a researcher can achieve more accurate results by supplying a script that directly computes the statistic of interest.
In this section, we consider a researcher who supplies a script to compute sample entropy. As a general rule, we have found that for roughly balanced datasets, a researcher can achieve greater accuracy by first computing a histogram using TAHOE, then computing the entropy of that histogram. For unbalanced datasets with low entropy, a researcher can achieve greater accuracy by submitting a script that computes entropy directly.
Again letting represent the size of the data alphabet D, the sample entropy of database with normalized histogram P ( ) = ( 1 , ..., ) is defined as, To configure TAHOE, we set = /5 and = 1/( +1) as before. To make the privacy wrappers as comparable as possible, we again set so that TAHOE never returns ⊥. Applying the same argument we used for the proportion script, we must ensure that for any subsets , ⊆ , with size at least − 2 − 1, ||E ( ) − E ( )|| 1 ≤ . While the left hand side can be bounded mathematically, we performed a brute force computational search to find the smallest value of . With our systems configured this way, we use simulation to measure the error of each system for randomly generated datasets. Each datapoint is drawn from a discrete uniform distribution over D. We found experimentally that varying the size of the alphabet does not significantly alter the results, so we fix the alphabet size at 3. Parameter is set to values from {1, 2} and is set to values in {10 4 , 10 5 , 10 6 , 10 7 }. Each experiment was replicated 50 times and all experiments were run on a single computer.
In Figure 3, we plot the total standard error (root-mean-square error) as a function of dataset size, for different parameter values. As in the histogram example, we found that GUPT outperforms TAHOE for small datasets. TAHOE does overtake GUPT's performance for large enough datasets, but the number of datapoints required is considerably greater than we observed for the normalized histogram. Several factors can help explain this result. First, the slope of the entropy function is greatest for unbalanced datasets with entropy near zero. To set the noise scale , we must consider subsets containing all of one letter, for which changing 2 + 1 entries can result in a large change in entropy. TAHOE therefore requires a relatively large amount of noise. Second, entropy is a one-dimensional quantity, so GUPT doesn't suffer a penalty for each dimension, as we saw in the previous experiment. Finally, because the entropy function has slope close to zero for relatively balanced datasets, and subsets are likely to have proportions close to , GUPT introduces very little sampling variation.

DISCUSSION
This study formalizes the notion of a privacy wrapper: an algorithm that can pass data to a researcher script and observe the return values, before returning an output to the user. We believe that this is a useful framework for reasoning about untrusted code. Differential privacy extends naturally to this setting, with the standard probability bound required to hold for every possible script. Moreover, the script is treated as a black box, avoiding the limitations inherent in analyzing code.
We present our own design for a privacy wrapper, which we call TAHOE. Our algorithm operationalizes two core ideas: find subsets fulfilling a certain stability primitive, and randomize over the subset size. Due to the large number of subsets involved, TAHOE is impractically slow in most scenarios. On the other hand, we consider the special case of a finite data alphabet and describe a set of optimizations that allow TAHOE to run efficiently.
Performance simulations show that TAHOE's performance is comparable to GUPT, a benchmark algorithm from the subsampleand-aggregate lineage, with TAHOE displaying better accuracy for some scenarios and parameter values. TAHOE performs relatively well when the dataset is large, when the dimensionality of the output is high, and when is not too small.
and is fixed by the proposition, the runtime is O ( + )( + 1) 2 +1 In Step 3, TAHOE computes on at most −2 −1 subsets and on 2 vectors ∈ , which is less than O ( + )( + 1) 2 +1 . It can be checked that Steps 1, 4, and 5 require fewer operations than Step 2, and so they do not alter the big-O. Plugging in the upper bound for , the runtime is bounded by Since and are fixed by the proposition, the bound is O ( ).

In
Step 3, TAHOE computes on at most 2 +1+ subsets and on 2 vectors ∈ , which is less than O ( ). It can be checked that Steps 1, 4, and 5 require fewer operations than Step 2, and so they do not alter the big-O. Plugging in the upper bound for , the runtime is bounded by . □

Proof of Theorem 2
Proof. Let random variable be the subset chosen by TAHOE, and let random variable ∼ (0, ) be the Laplace noise added by TAHOE. Since TAHOE is configured to never halt, the output of TAHOE is then the random variable ( ) + .
Let represent the distribution of ( ). Since the data is drawn independently from D with distribution F, conditional on , has the distribution F , and so ( ) has the distribution . Therefore, the distribution of ( ) is