How to Combine Membership-Inference Attacks on Multiple Updated Machine Learning Models

A large body of research has shown that machine learning models are vulnerable to membership inference (MI) attacks that violate the privacy of the participants in the training data. Most MI research focuses on the case of a single standalone model, while production machine-learning platforms often update models over time, on data that often shifts in distribution, giving the attacker more information. This paper proposes new attacks that take advantage of one or more model updates to improve MI. A key part of our approach is to leverage rich information from standalone MI attacks mounted separately against the original and updated models, and to combine this information in specific ways to improve attack effectiveness. We propose a set of combination functions and tuning methods for each, and present both analytical and quantitative justification for various options. Our results on four public datasets show that our attacks are effective at using update information to give the adversary a significant advantage over attacks on standalone models, but also compared to a prior MI attack that takes advantage of model updates in a related machine-unlearning setting. We perform the first measurements of the impact of distribution shift on MI attacks with model updates, and show that a more drastic distribution shift results in significantly higher MI risk than a gradual shift. We also show that our attacks are effective at auditing differentially private fine tuning. We make our code public on GitHub.

of individuals who contribute their data. However, there is now a robust body of literature demonstrating that unless explicit steps are taken to ensure privacy, these models will leak sensitive information. In particular, as first shown more than a decade ago by Homer et al. [18], even the simplest statistical models, when trained via standard techniques, will allow for membership-inference (MI) attacks, in which an attacker can detect the presence of individuals in the training data, or in certain subsets of the training data. MI attacks can be a concerning privacy violation on their own, if membership in the training data, or a specific subset of the training data, indicates something sensitive about the user or can be used as a step towards reconstructing training examples. MI attacks have since become an active area of research in statistics, machine learning, and security, and we now have a rich toolkit of MI attacks [4,14,36], which notably includes black-box attacks on modern large models in supervised-learning [5,20,33,39,43].
Most MI research has focused on the case of a single standalone model. However, in real machine learning workloads, models are typically updated as new training data arrives, and attackers have the ability to observe some aspects of the model both before and after updates. Intuitively, giving the attacker the ability to see the model before and after the update should reveal information about the specific training examples in the update set. To illustrate, consider the effect of model updates on MI for the simple statistical task of mean estimation. Example 1.1. We are given a set of training examples 1 , . . . , ∈ R and want to release its mean. The theory of MI attacks [14,36] tells us that, under natural conditions on the data, we will be able to accurately infer membership if and only if ≫ . However, suppose that data arrives over time and we initially release the mean 0 of the first − points for some ≪ , followed by an update 1 with the mean of all points. It is not hard to see that we can combine 0 and 1 to obtain the mean of just the last points, and we can perform accurate membership inference on this even when ≫ ≫ , which is impossible given only 1 .
While the statistical models trained over user data by modern machine-learning workloads are far more complex than a simple mean, the same kind of effect is expected from repeated releases of updated models. Such repeated releases are common. Production training platforms, such as Amazon SageMaker [38], Azure Machine Learning [2], and Tensorflow-Extended [17,30], all support automatic model updating on newly collected data to keep up with changing distributions or improve models over time. The frequency and method for model updates differs, but often an update is triggered by the arrival of a sizeable data batch, such as a day's or a week's worth of data, and involves fine-tuning an already deployed model on data from the new batch and potentially samples from previous batches [30]. For example, a news-recommendation model may be updated daily to keep up with events in the news, and a product-recommendation model may be updated weekly to capture evolving trends. In each case, updated models are repeatedly released for serving or are pushed to users' mobile devices or servers all around the world for faster predictions.
This paper investigates the threat of repeated model updates for an attacker who monitors their releases and wishes to infer membership of specific samples in each update dataset. We formalize the problem of MI under repeated model updates in a way that supports a wide range of model update procedures, sizes of update batches, and distribution shift in the new data (Section 3). Geared toward this problem, we develop the first black-box MI attack algorithms that combine information from previously known standalone MI attacks-such as the state-of-the-art LiRA attack [5]-to let the adversary take advantage of access to both the original model and one or more updated models to improve MI on the update set (Section 4). Our algorithms compute the standalone attack's confidence scores separately against the original model, then against the update model(s), and combine them to obtain a confidence score for membership in the update set. We justify the need to use detailed confidence scores information by proving that combining only the binary membership decisions does not increase the attacker's power. We consider two different methods for combining scores, each motivated analytically by the study of a simple example. Our analysis and experiments demonstrate that the best choice of score will depend on the specific learning algorithm being attacked.
Some previous works have studied MI attacks involving multiple models trained on overlapping datasets [35,44], for example arising from intermediate computations revealed by federated learning systems [29,32,42] or from model unlearning [9]. Our work, however, is the first to study a number of aspects specific to the repeated model-update setting, including updates with sizeable batches of data, multiple updates, and the effect of distribution shift.
We evaluate our algorithms on four datasets-FMNIST, CIFAR-10, Purchase100, and IMDb-using suitable linear and neural network models. We highlight several key contributions and conclusions: (1) We show that access to one or more updated models makes an attacker significantly more effective at inferring membership in the training data, compared to having only a single standalone model (e.g., on FMNIST, MI accuracy increases from 52% without updates to 79% with updates). We show this effect is more dramatic the more the distribution shifts between the original training and the update. (2) We consider a variety of attack algorithms and tuning methods, and demonstrate both analytically and empirically that no single method is best in all situations, highlighting the need for a varied testing strategy. We also demonstrate that our attacks are more efficient and effective than attacks designed for the related, but distinct, setting of machine unlearning [9]. By basing our strategy on existing MI attacks, we expect our attacks will improve as MI attacks do. (3) We consider multiple algorithms for updating models, and show that for small update sets, using the whole dataset to update the model is less vulnerable to MI attacks compared to training on just the update set, and the opposite is true for larger update sets. Our findings offer some guidance for practitioners employing basic defenses. (4) We audit differentially private defenses, and show that they offer strong protection with small privacy parameter, but only modest protection with large parameter. In particular, the worst-case bounds offered by differential privacy can be close to tight in practice (within 2.0-3.6x in some settings).
Overall, our work demonstrates that the model updates arising from production machine-learning systems significantly increase the risk of MI attacks, and highlights the many subtleties that arise in constructing both attacks and defenses.

RELATED WORK
A long line of work has recognized the privacy risks of machine learning. Early work considered very simple statistical tasks [4,14,18,36], and black-box attacks on machine learning algorithms were developed later [39,40,43]. Other types of privacy attacks have also been considered, such as training data reconstruction/extraction [7,35], and attribute inference [43]. MI attacks have also been considered in white-box [26] and label-only settings [10,27], and have been evaluated in unbalanced scenarios, where training points are much less common than test points [21].
Membership-Inference with Model Updates. Our work considers an adversary seeking to run MI attacks in the setting where a model is updated repeatedly over its lifetime. A few privacy attacks on model updates have been considered in prior work, but in distinct settings compared to ours.
First, in federated learning, a decentralized model update procedure, attacks have been shown to leak the label composition [42], attributes present in local datasets [29], and MI attacks [32].
Second, Salem et al. [35] show that an attacker with query access to an initial and an updated model can perform reconstruction attacks to recover the labels and feature values of the update points. The reconstruction attack is designed for small updates to the model, and works only in the online learning setting where only new points are considered when generating the update. The differences from our work are: (1) we focus on membership inference attacks on models supporting multiple updates; (2) we develop new attacks for this setting that combine existing standalone-model MI attacks without the heavyweight construction of many shadow models; and (3) we consider a range of model update regimes, by varying the size of the update samples, the distribution shift in the new data, and the retraining procedure, including attacks for multiple updates.
Third, Chen et al. [9] construct MI attacks for machine unlearning, in a setting where an adversary has access to an initial model and an updated model after a set of examples are removed from training. We observe that, when updating or unlearning are performed by retraining from scratch, an adversary with access to the models before and after unlearning is related to a model update adversary having access to the models before and after updating. We perform an experimental comparison with their attacks in Section 5.8 and find that our attacks, geared towards our specific setting, are significantly more powerful compared to theirs, which were not designed for this setting. In addition, the model updates setting considered by our work is more general in supporting multiple updates, data distribution shift under updates, and different training regimes (online learning and full retraining).
Memorization Attacks. Memorization attacks against generative language models demonstrate that training data can be extracted by an adversary with black-box query access to a model [6,7]. [7] generate and rank text samples from GPT-2 and use MI attacks to test that a generated sample belongs to the training data. [45] show that model updates in generative language models improve memorization attacks.
Differentially Private ML. Privacy attacks have inspired privacypreserving training algorithms, including defenses specifically designed to prevent MI attacks [22,31], as well as the adoption of differential privacy [13]. Differentially private machine learning algorithms [1,3,8,41] are a defense against MI attacks, and while deployments of them exist, they are still relatively rare compared to the scale of machine learning workloads at large companies. Still, we experimentally evaluate differential privacy in the context of our attacks and show that differential privacy is an effective protection at low privacy parameters, and that our attacks can be an effective empirical audit of a differential privacy deployment.

MI WITH MODEL UPDATES 3.1 Background
Many supervised learning algorithms exist to train machine learning models from labeled data. In this work, we consider classification problems, where samples are taken from a data domain X and the output space is a discrete set of classes Y = [ ] 1 . With a training algorithm, the learner typically learns some set of parameters which are used to evaluate the model function and minimize a loss function.
In our empirical evaluation, we consider logistic regression and various neural network models. A neural network is a model ( ) which is computed as the chain of layers • −1 •· · · 1 • , where each layer function takes its input and computes ℎ( + ), where , are trainable weights (the parameters are the weights from every layer) and ℎ is a nonlinear activation function. A common activation function is ReLU ℎ( ) = max(0, ) [16]. In tabular data, it is common to use simple models, such as logistic regression (where = 1) or small neural networks (small ). Image data typically uses deeper networks which use convolutions [25], a constraint on the weight matrices which exploits the structure of images to significantly improve performance. For classification, networks typically use a softmax output, which produces probability values for each of the classes.
Training models typically proceeds by using gradient descent on a given dataset . This requires defining a loss function ℓ, which measures the model's performance on a given data point , . After  An adversary with query access to both models 0 and 1 can more effectively distinguish whether a query point ( , ) is in the update set 1 or is an independent sample from the same distribution as 1 .
initializing , a batch of samples { , } =1 is selected from the training dataset, and the model parameters are updated in the inverse gradient direction of the loss, averaged over the batch, as where is a learning rate. Batches are sampled from until all points are used, and this is repeated multiple times, called epochs.

Threat Model
We consider supervised machine learning problems in which the model parameters are updated over time by retraining with new data, which is common in applications for a variety of reasons. In general, as more labeled data becomes available, models should be updated to correct for potential errors and improve their generalization.
Another key reason for model update is the potential shift in data distribution over time. For example, a sentiment analysis model trained on social media may need to be adapted to take into account recent events, models used for financial market forecasts or loan prediction need to be updated as the market evolves continuously, or medical datasets may need to be updated to incorporate larger and more diverse populations. The frequency of model updates is dependent on the application requirements, the data distribution shift, and the envisioned deployment scenarios. For many industrial applications, model retraining has become part of machine learning deployment pipelines. Our goal is to study the privacy implications of model updates over time, considering factors such as the size of the update, the number of updates, the data distribution shift, and the training regime. In our setting, a learner is given + 1 datasets 0 , 1 , · · · over time, where each dataset consists of samples = selected from a distribution D . The learner runs a training algorithm train on 0 to produce an initial model 0 . The learner is then provided with each new dataset , and produces a new model by running an update algorithm up using only datasets 0 , 1 , . . . , . This process is described in Algorithm 1.
In our work, we consider a fixed value for the update size at each iteration = up for all > 0, and vary this value up . In standard membership inference (MI) attacks [35,39,43], the adversary can interact with the machine learning model in a blackbox manner, with the goal of distinguishing if a data sample was part of the training set or not. In our model update setting, we consider a black-box adversary A, who is capable of observing the Exp-MI-Updates(A, up , 0 , 0 , up , D 1 , D 2 , · · · , D ): (1) For ∈ [ ], draw up samples from D : up ∼ D up .
(3) Sample ∈ [ ] and ∈ {0, 1} uniformly at random. represents whether the test sample is in training or not, while represents which update it belongs to.  output of each model on multiple query points, but does not have knowledge of the specific models architecture or parameters. We assume the adversary knows when the model is updated, potentially by querying the model and noticing when it changes its predictions. As models are retrained with new data, the adversary's goal is to infer if a data sample was part of the update set or not. In the setting with multiple model updates, the adversary is also interested in inferring at which time epoch the data sample was used to update the model. Figure 1 visualizes this threat model. We remark that the adversary here only makes inferences about the update set, not the initial training set. In general, the update set will generally be smaller than the initial training set, so our attacks are designed to cause maximum leakage on a small subset of the data.

Formalization of MI under Model Updates
We formalize the problem of membership inference with model updates by adapting the membership inference experiment of [43] to the model update setting. We present the experiment in full generality, and introduce specific contexts for which we subsequently develop specific attack algorithms. Using our experiment, we also suggest a new form of leakage that can occur in the multiple update setting, called entry inference.
The membership-inference experiment. Let up be a model update algorithm, 0 an initial training set, 0 an initial model, up an update set size, and each D with ∈ [ ] a distribution over samples ( , ). We define the MI experiment in Figure 2. The experiment allows the adversary access to all updated models, and requires it to distinguish between update and test data.
We instantiate this experiment in three settings, for attack algorithm development (Section 4) and evaluation (Section 5): single update, multiple updates, and single update distribution shift.
Single model update. In the single update setting, we consider = 1, D 1 = D, so the initial training distribution does not change to sample updates. This setting lets us understand the difference between having access to the models before and after an update compared to only the final model. To evaluate the performance of our membership inference attacks, we can measure accuracy, E[ ], and precision, E[ |ˆ= 1]. An attack maximizing precision may differ from one maximizing accuracy. We also measure recall, E[ˆ= 1| = 1], but, as noted in [26] and [21], a MI attack achieving high precision is likely to be more harmful than one achieving high recall. 2 An attack which classifies every sample as appearing in the training set is not harmful, but obtains a high recall; meanwhile, an attack correctly identifying a single sample as appearing in training set has tiny recall, but harms that sample.
Multiple model updates. The multiple update setting considers > 1 and D = D for all , so the training distribution remains constant. This lets us understand the difference between access to multiple models and only the last model. In this setting there is also a richer set of attacks that can be run.
• Membership Inference: As before, we measure the membership inference accuracy. This is the success rate at inferring if a data sample is part of any of the update datasets, The harm here is identical to the standard harms suggested for membership inference: membership in a medical study may indicate that somebody has a disease, or membership in a loan dataset might indicate that someone applied for a loan. • Entry Inference: We also introduce this novel threat, where the adversary guesses precisely at which update step a user enters the dataset, with success rate E[ ]. This measures how well the adversary can determine when a user started being used in training, which itself may be sensitive. For example, when a patient enters a medical study may leak when they contracted a disease, or knowing when a loan was issued may leak sensitive information about someone's financial status. There are also cases where membership inference is more of a threat than entry inference; for example, if high membership inference success leads to social security numbers being extractable from a large language model [6], it is much more important that the social security numbers are leaked than when they entered the dataset, making membership inference more scary than entry inference. However, in other cases, membership inference may also be less of a concern than entry inference. For example, membership in a commerce dataset may indicate that someone has been to a certain coffee shop, which itself may not be risky. However, knowing when the person entered the dataset may allow an adversary to link this time with other people's entry into the dataset, and learn about meetings between people. Ultimately, membership inference and entry inference are both risks and can lead to harms in meaningfully different ways. Finally, we note that any attack which achieves a membership inference accuracy of can be converted to an entry inference attack by randomly selecting one of the update sets, achieving entry inference accuracy of / . A standard random guessing baseline therefore reaches a membership inference accuracy of 1/2 and an entry inference accuracy of 1/2 .
Distribution shift. In this setting, we consider = 1 and D 1 ≠ D.
While distribution shift may happen over several updates, we elect to isolate the impact that distribution shift has on our attacks. To measure attack performance, we use the same metrics as in the single update setting: membership inference accuracy, precision, and recall.
Retraining methods. There are multiple ways to use the datasets 0 , 1 , · · · to update model −1 at epoch , which we call: • SGD-New. This strategy updates the model −1 using only the new training set = . To prevent forgetting earlier datasets, one must use a small learning rate and few epochs. • SGD-Full. This strategy updates the model −1 using the entire training set available at epoch , With a larger dataset, one can increase the learning rate and number of epochs, at the cost of using less recent data. This setting is practical for models with modestly sized training sets, and may be impractical for some models, like large language models. These training strategies have also been used by previous work considering model updates. Zanella-Béguelin et al. [45] compare both strategies for extracting training data from generative language models, while Salem et al. [35] use a variant of continual training for mounting reconstruction attacks.

ATTACK ALGORITHMS
We develop attacks for the single and multiple update instantiations of the MI under model update problem introduced in the preceding section. Our attacks are generic with respect to distribution shift and retraining methods, so we discuss those topics directly as part of our evaluation of the proposed attacks (Section 5). We first focus on attacks for the single update setting (Section 4.1). We propose multiple options and subsequently justify analytically both their designs and the need for the options (Section 4.2). Finally, we propose several attack options for the multiple update setting (Section 4.3). Table 1, placed at the end of this section, summarizes the various attacks and their options for easy access.

Single Update Attacks
Given a single model 0 trained on a dataset 0 , and an individual target example ( , ), the standard black-box way to test membership of in 0 is to compute an appropriate score function ℓ ( , ; 0 ) and apply some threshold to this score. Now suppose we are given two models 0 , 1 trained on datasets 0 and 0 ∪ 1 respectively, and a target example ( , ), whose membership in 1 we want to infer. Intuitively, being a member of 1 is equivalent to being a member of 0 ∪ 1 and a non-member of 1 . So, a first attempt is simply to infer membership in 0 ∪ 1 and membership in 0 and decide membership in 1 appropriately.
However, as we show in Section 4.2, if the score information is binary (e.g. it is the output of some membership-inference attack for each standalone model), then model updates do not increase the accuracy of membership-inference attacks.
However, score-based membership-inference attacks give more information than just the binary outcome, and a key contribution of our work is to show how to strictly outperform the preceding baseline by combining the two scores, ℓ ( , ; 0 ) and ℓ ( , ; 1 ). Given these two scores, there are multiple logical ways we could combine them to produce a single score for membership in 1 . In this work, we introduce two main strategies: ScoreDiff and ScoreRatio. As their names suggest, we define where > 0 is a damping constant to avoid instability when the denominator is close to 0. As we will show empirically in Section 5, neither of these two methods for combining scores dominates the other, and in Section 4.2 we give analytical justification for why each score can sometimes be superior.
While we could take the approach of Chen et al. [9] and learn how best to combine scores from scratch, our evaluation will show that choosing a fixed combiner, such as ScoreDiff or ScoreRatio, is both more efficient and more effective.
To use these methods for combining scores, we need to do two things: (1) instantiate these strategies by choosing the score function, and (2) determine a threshold to apply to convert the realvalued scores into a binary membership decision. As score functions, in this paper we use the standard cross-entropy loss [43] and the state-of-the-art LiRA score function [5]. These are both computable with only class probabilities, but LiRA requires training shadow models. Future single-model MI attacks might provide even better score functions that our strategies can incorporate.
In this work we consider multiple ways to set the threshold. To motivate these methods, we return to the standard interpretation of membership inference as a hypothesis-testing problem. Once the score function is fixed, any query point gets mapped to some value . Typically, is compared to a threshold : membership as IN if ≥ , and OUT otherwise. This performs well, because the distributions IN of IN scores and OUT of OUT scores will differ. In practice, however, the attacker does not know the IN and OUT distributions, and so needs some side information to find a good threshold. All membership-inference attacks give the attacker some side information for this goal. In this work we consider a few different types of side information the attacker can have, corresponding to different ways of setting the threshold: • Batch Strategy. The adversary has access to a dataset containing both update points (IN) and test points (OUT), but does not know which are IN. Given these points, the adversary can compute scores and thereby see samples from the distributions IN and OUT and can then find an optimal threshold for distinguishing these points. For example, if the attacker has update points and test points for large , then the quantiles of these 2 scores will give a good threshold. In our work we choose the median to maximize accuracy and the 10 th percentile to maximize precision. The assumption that the attacker has access to many update points is strong, but a useful thought experiment, since the attacker can try to approximate these samples on their own using other forms of side information, as our next approach does.
• Transfer Strategy ("Shadow Models"). The adversary has access to some set of test points. Using these test points, the attacker trains shadow models [26,35,39]0 and then updates these models using a random half of the test points. Provided that the test points are drawn from the same (or similar) distribution and the attacker can train the models in the same (or similar) method as the real models, this should allow us to approximate IN and OUT . Since the attacker knows which points were included in the update set, the attacker now has a batch of update points and test points, so the attacker can use the batch strategy to obtain a good threshold for these points and then transfer that threshold to the models they are trying to attack. This only uses one shadow model, as we identify a threshold on a scalar value, rather than train a classifier, as many shadow model attacks do [9,35]. • Rank Strategy. The adversary again has access to some set of test points. The attacker can use these to generate samples from OUT , and given these samples, can use the -quantile of the distribution as a threshold. This strategy will achieve a false-negative rate of . This strategy was employed by Leino et al. [26].

Analytical Justification
The preceding single-update attack algorithms rely on two key insights that our work contributes: (1) that taking advantage of model updates requires combining rich score information from individual-model MI attacks against the model and its update; and (2) that the two specific methods, ScoreDiff and ScoreRatio, that we propose for combining two scores into a single MI attack in the model-update setting are both needed and justified. This section provides analytical justification for these two insights. Section 5 provides empirical evidence in support of these claims.
The need for rich score information. We justify our score combination approach by showing that model updates can only increase accuracy of membership-inference when the attacker can obtain rich (non-binary) score information. We prove that, under reasonable assumptions, an adversary with only access to the binary scores does not improve when given access to the initial model. Here, we consider a threat model where a MI attack A returns only binary IN/OUT predictions on a point. Any MI attack eventually returns this, and some MI attacks (e.g. the "gap attack" [10]) only use binary information. Given access to two models, there are four possible outcomes for the MI attack: The adversary can make a decision for each of these outputs, the frequencies of which we write below (as is standard, we assume any point has a .5 probability of being included in the update or test set, so these frequencies are the frequencies for an update or a test example).

Update
Test (1) The attack on 0 returns IN equally often on update and test points: 11 + 10 = 11 + 10 . This is realistic, as both sets of points are not in 0 's training set. (2) The attack on 1 returns IN more frequently on update than test points, for both points classified as IN and OUT by the attack on 0 : 11 > 11 , 01 > 01 . This is realistic, as the attack should return IN frequently on training data. Theorem 4.1. For any single-model membership inference attack A returning only a binary IN/OUT prediction satisfying the above assumptions, there exists an adversary receiving the output of A on 1 that obtains at least as high accuracy as any adversary with access to the output of A on both 0 and 1 .
Proof Sketch. The proof identifies optimal decisions for the adversary with and without updates. The optimal strategy without updates follows the guesses of A. We show that, because the update leads to more IN guesses on update points, the optimal strategy with updates is the same as without. The full proof is in the Appendix. □ The main takeaway of Theorem 4.1 is that for an attack to successfully use model updates, it must exploit some rich score information, rather than binary IN/OUT information, which includes the model's generalization gaps. We note that it is impossible to prove that richer score information will lead to successful update-based attacks; as a simple counterexample, the richer score could take one of three values, but only two are observed in practice. In this case, Theorem 6 would apply, and updates would not help. As a result, we see it is important to design setting-dependent score combination approaches, as we do in Section 4.1.
Justifying ScoreDiff and ScoreRatio. We justify our choice of score combination functions by studying the example of computing the mean and the median of the training data. These two examples will show both that our choices are well motivated, and also demonstrate that the right approach depends on the specific learning problem being solved, and thus there is likely not a single best method. Section 5.5 confirms these claims empirically.
The earliest work in membership-inference [19,36], in the singlemodel/no-update setting, justified specific membership-inference attacks by exploiting a connection to hypothesis testing and use the Neyman-Pearson Lemma to devise optimal attacks. Proving exact optimality typically requires making strong distributional assumptions, and being able to reason explicitly about the exact distribution of the outputs of the learning algorithm. In our work, we mostly consider learning algorithms that are too complex for this sort of precise analysis (such as neural networks), so we settle for a more heuristic justification instead. In particular, we will consider a learning algorithm that outputs the exact minimizer of the loss function to obtain the initial model 0 , then performs an update on a single point 1 by performing a single gradient step from 0 with fixed step size to obtain the updated model 1 . This update strategy corresponds to what we call SGD-New, but with a single step of training. We then analyze how the loss on the point 1 changes as a result of the update and will see that the change is best reflected by either ScoreDiff or ScoreRatio, depending on the loss function.
In the Appendix, we show that ScoreRatio is a perfect membership inference attack for the ℓ 2 loss, and ScoreDiff is a perfect attack for the ℓ 1 loss. In the Appendix, we study mean estimation under updates, showing that model updates provably improve accuracy, and lower bounding the accuracy of ScoreDiff.

Multiple Update Attacks
Multiple updates can allow leakage in two ways: an adversary can learn both whether a user is contained in a training set (membership inference), but also when they begin participating in that dataset (entry inference). The former is the standard membership inference task, but entry inference may also be sensitive in several settings, such as in medical datasets, where someone could learn when a patient contracted a disease. We construct an attack for each goal: the Back-Front attack for membership inference and the Delta attack for entry inference, in Algorithm 2.
Back-Front attack. This attack ignores all information except for the first and last model update, and is designed for the membership inference case. This attack is the natural adaptation of the single model attacks to the multiple update setting, as it treats the sequence of updates as a single, large, update algorithm. The attack uses either the score difference or ratio between the original model 0 and the final model after updates.
Delta attack. This attack is designed to perform entry inference. In this attack, we identify samples to a specific update when they have a large loss difference (or ratio) on the two consecutive models −1 , produced by the update. Notice that we can also adapt this attack (or any entry inference attack) to a membership inference attack: a sample which is predicted to be in any update is predicted to be IN. As a result, we measure this attack's performance on both membership and entry inference.
Threshold setting. The three thresholding strategies we defined for the single-update setting, Batch, Transfer, and Rank, can be adapted to configure the threshold for the multi-update attacks. However, for simplicity, we focus on the Batch strategy. With Batch, the adversary is given a dataset which contains each update set { } =1 of size , as well as a test set of size . For the Back-Front attack, we set the threshold to the median value of the entire dataset . For the Delta attack, we set each threshold so that points are classified into each update index.

EVALUATION
We next evaluate our proposed algorithms for the single-and multiupdate settings by answering seven key questions in the context of the datasets described in Section 5.1: Q1: Does access to one model update give the attack an advantage over models with no access to updates? How does the update set size impact this advantage? (Section 5.2) Q2: Does attack success improve with more updates? (Section 5.3) Q3: How does the training strategy-SGD-New or SGD-Fullimpact attack performance? (Section 5.4) Q4: How do the various attacks and thresholding choices impact attack performance? (Section 5.5) Q5: How does distribution shift impact attack performance? (Section 5.6) Q6: How would adoption of differential privacy impact attack performance? (Section 5.7) Q7: How do our attacks compare to those developed for unlearning in Chen et al. [9]? (Section 5.8) Since we propose multiple attacks for two different settingssingle-update and multiple updates -and each attack can be instantiated with multiple score functions and thresholding methods, the space of experimentation is quite large. Moreover, performance of an attack can be measured in different ways, such as with accuracy, precision, and recall for the single-update setting, and membership inference accuracy and entry inference accuracy for the multipleupdate setting. To tame this large experimentation space, we answer different questions in the context of different algorithms, methods, settings, and metrics that are most relevant for the specific question. Q1-Q3 choose the best-performing attack relevant to the considered setting -single or multiple updates -and focus on membership inference accuracy and entry inference accuracy, respectively. Q4 compares key pairs of algorithms and mechanisms under multiple performance metrics. And Q5-Q7 focus on the single-update setting and best-performing algorithms.

Datasets
In this section we describe the datasets used in our evaluation. We use the Adam optimizer for transfer learning datasets (CIFAR-10 and IMDb) and SGD otherwise, as we find this works best.
FMNIST. FMNIST is a 10-class dataset consisting of 28x28 pixel grayscale images of different clothing items. On this dataset, the initial model is a logistic regression model, trained on an initial dataset of 1000 data points for 50 epochs at a learning rate of 0.01. We use 1000 data points to be able to train many of these models, including running LiRA calibration. On average, this achieves 82.5% training accuracy and 79.5% test accuracy. SGD-New trains for 10 epochs at a learning rate of 0.001, and SGD-Full trains for 10 epochs at a learning rate of 0.01. CIFAR-10. CIFAR-10 is a 10-class dataset of 32x32 pixel RGB images of various animals and vehicles. This dataset is harder than FMNIST, and requires more complex models to achieve reasonable accuracy. We fine-tune a VGG-16 network which was pretrained on the ImageNet dataset. The initial model is trained with 12 epochs over a training set of 25000 points at a learning rate of 10 −4 . On average, this achieves 98.2% training accuracy and 82.6% test accuracy. SGD-New trains for 4 epochs at a learning rate of 10 −5 , and SGD-Full trains for 2 epochs at a learning rate of 10 −4 .
Purchase100. Purchase100 is a 100-class purchase history dataset. The task is to classify a shopper into one of 100 clusters. Here, our initial model is a single layer neural network trained on an initial dataset of 25000 samples for 95 epochs at a learning rate of 0.01. On average, this achieves 94.1% training accuracy and 84.1% test accuracy. SGD-New trains for 5 epochs at a learning rate of 0.01, and SGD-Full trains for 10 epochs at a learning rate of 0.1.
IMDb. IMDb is a text dataset of movie reviews, where the task is to classify a movie review as either positive or negative. Here, we fine tune the BERT base model (uncased) which was pretrained on a large collection of English data. The initial model is trained with 4 epochs over a training set of 25000 points at a learning rate of 10 −5 . On average, this achieves 99.4% training accuracy and 93.8% test accuracy. SGD-New trains for 6 epochs at a learning rate of 10 −6 , and SGD-Full trains for 3 epochs at a learning rate of 10 −5 .

Advantage from a Single Update
Q1: Does access to one model update give the attack an advantage over models with no access to updates? How does the update set size impact this advantage?
To evaluate MI advantage from a single update, we compare with three baseline attacks that use only the updated model 1 . The first baseline, called Loss, uses the approach of Yeom et al. [43], which compares the loss on a point to the average training loss. The second baseline, called Gap, uses the gap attack of [10], which classifies correctly classified points as training and incorrectly classified points as test. These strategies capture the role of label memorization as a cause for vulnerability [15], and so attacks which perform better than these cannot rely solely on such memorization. For all datasets except IMDb, we also evaluate use the LiRA attack [5] as a baseline, which trains shadow models to compute sample-specific baseline loss values to compare to. Figure 3 shows the accuracy of the best of our single-update attacks, compared to the best baseline without access to updates. We show these accuracies for update sizes varying from 1% to 32% of the original training set (except IMDb, where we use a fixed 10-320 points for acceptable running time). The best attack differs in each setting. For updates, fixing the Batch thresholding strategy, we choose the best of {ScoreDiff, ScoreRatio} × {loss score, LiRA score}. For no updates, we choose the best of {Gap, Loss, LiRA} baselines. For both update and no-update, we show accuracy for both training strategies SGD-Full and SGD-New.
For all datasets, update sizes, and training strategies, attacks with model updates outperform the no-update attacks. As expected, the gap between updates and no updates decreases as the update set gets larger. On FMNIST, for example, at 10 update points, the Batch attack achieves 79% accuracy, while the Batch attack achieves 70% accuracy at 320 update points. The gap between the update and no update attacks decreases from 27% to 18%. As an additional exploration of this effect, we compare the accuracy of our attacks with the success rate of single model attacks when the training set is the same size as the update in Appendix C, and find that our attacks are often more effective than this setting as well. Also in Appendix C, we evaluate disparate impact, finding some classes are somewhat more vulnerable to our attacks.
Q1 answer: Overall, our results show that updates give the adversary significant advantage to identify training set members.

Advantage from Multiple Updates
Q2: Does advantage improve with the number of updates?
To evaluate the threat of MI and entry inference with the number of updates, we run our multi-update attacks on a sequence of 1 to 10 updates, and observe how each attack's performance changes. To isolate the role of multiple updates, we fix up to 250 for CIFAR-10, 100 for Purchase100, and 10 for FMNIST and IMDb (1%, .4%, 1%, and .04% of the initial dataset, respectively). We measure both membership and entry inference accuracies, which are the metrics relevant for the multi-update attack, however we focus in this section on entry inference accuracy and evaluate membership inference accuracy in a subsequent section. However, we remind the reader that a membership inference attack can be used to construct an entry inference attack, which we will use as a baseline.
We compare against two baselines. The first, called Random, is random guessing. It randomly selects IN or OUT with probability 1/2 and selects the update index uniformly at random. This results in a success probability of 1/2 when there are updates. We also define a stronger baseline, called Generic, which runs a membership inference attack to answer IN or OUT, and selects the update uniformly at random. We ignore the details of membership inference attacks for this section. If the membership inference accuracy of the attack is , then it obtains entry inference accuracy / . When membership inference accuracy is better than random chance ( > 1/2), Generic outperforms Random. Figure 4 shows the multiplicative improvement of entry inference accuracy over random guessing as a function of the number of updates. As before, we show the best attack for each case, fixing the thresholding strategy to Batch. Our best attacks significantly outperform the baselines for entry inference, with the gap increasing with the number of updates. For example, for SGD-New on CIFAR-10, at 2 updates, our best attack (in this case, ScoreRatio with cross-entropy loss as the score) outperforms the baseline by a factor of 2.03×, while at 8 updates, it outperforms the baseline by 6.10×. Still, for both the baselines and our attacks, the absolute value of entry inference accuracy decays with the number of updates. This is natural, as increasing the number of updates makes it more difficult to guess which of the many updates a point was used in-this is why the Random and Generic baselines' success probabilities -1/2 and / , respectively -decay with . For our attack, the absolute value of entry inference accuracy for CIFAR-10 is 50.8% at 2 updates and 38.1% at 8 updates. Yet, our attacks fare much better compared to baselines, whose entry inference accuracy for CIFAR-10 is 25% at 2 updates and 6.25% at 8 updates. For a real adversary, it is unclear whether the absolute accuracy level or the . For updates, Random guessing achieves 1/2 accuracy. We present the best attack strategy for each dataset: for FMNIST and Purchase100, this is loss difference, but loss ratio for CIFAR-10. As the number of updates increases, the attack performs significantly better than random guessing, although the absolute accuracy decays.
improved attack performance will be more relevant, and future work may be required to increase the absolute attack success rate further, especially for very long training processes.
Q2 answer: Thus, as the number of updates increases, the absolute accuracy fundamentally decreases for all cases, but our attack significantly outperforms appropriate baselines.

Impact of Training Strategy
Q3: How does the training strategy-SGD-New or SGD-Fullimpact attack performance?
The success of the MI attack depends on the training strategy, which is a choice the learner makes that the adversary cannot influence. If one training strategy were consistently less vulnerable to attack than the other, then the learner could choose the former, as a heuristic defense. We find that such a heuristic exists, but it depends on the update size and dataset, making it difficult for the learner to configure without experimentation with our attacks.
We revisit Figures 3 and 4, which already illustrate the effect of SGD-New and SGD-Full on attack performance with one and multiple updates, respectively. For a single update, Figure 3 shows the attack is more effective with SGD-New when the update set is small. However, the attack's performance on SGD-New degrades faster than on SGD-Full as the update set size increases. For example, on CIFAR, the gap between 250 and 8000 update points is 10% for SGD-New but only 2% for SGD-Full. This faster drop in performance with SGD-New on larger update sets makes SGD-Full, in fact, more vulnerable to attack for large update sets on CIFAR-10 compared to SGD-New. The reason for this inversion is that SGD-Full would naturally use a higher learning rate for the update set compared to SGD-New. Interestingly, the inversion already happens on Purchase100 at the smallest update set size.
For multiple updates, we observe a similar and even more consistent effect. Figure 4, which uses a very small update set of 1% of the original training set, shows the attack as most effective with SGD-New. This is true even more consistently across datasets than with a single update. In the multiple update setting, SGD-New only uses each update point once, so an update point will observe a larger loss decrease in the update in which it appears compared to the loss in updates in which it does not appear. SGD-Full, by contrast, should observe a loss decrease in each update. We include results in Appendix C (Figure 10a) showing that for a larger update set size of 10% of the original training set, attacks on SGD-Full are more effective relative to attacks on SGD-New.
Q3 answer: Thus, as a rule of thumb, when update sets are small, training on the whole dataset is less vulnerable to MI attack; but when update sets are larger, training on only the update set is less vulnerable. The point at which the inversion happens depends on the dataset, so a learner should tune this heuristic, for example, by running our attack. Section 5.6 evaluates differential privacy as a far more principled approach for defense against MI attacks, yet still one that will likely require some tuning or auditing in practice, for which our attacks could still prove useful.

Impact of Attack Strategy Q4: How do the various attacks and thresholding choices impact attack performance?
In the preceding questions, we picked the best performing attack algorithm and thresholding strategy from the suite we are proposing. But are all these attacks/thresholds really needed, or do some of them outperform others consistently? The answer is, indeed, that different algorithms and thresholds perform better in different settings, so they are all relevant in an attacker's toolkit. For example, the LiRA score function is more difficult to compute than loss, so should be used when it can be reliably estimated. Recall that the adversary's knowledge determines their choice of threshold. We show here a few main comparisons.
ScoreDiff vs. ScoreRatio in terms of accuracy. Focusing first on the single update setting, Table 2 shows attack accuracy with an update size of 1% of the original training set. We compare ScoreDiff and ScoreRatio when using the standard cross-entropy loss and either the Batch or Transfer thresholding strategy (Batch and Transfer are the only ones relevant for accuracy). We show results for both SGD-New and SGD-Full training strategies. The ScoreRatio strategy typically outperforms ScoreDiff. However, there are exceptions, especially at large update set sizes. The two threshold selection strategies, Batch and Transfer, have comparable performance in the cases we show. We also ran the same evaluation with LiRA as the score in ScoreDiff and ScoreRatio. LiRA consistently performs better than traditional loss, except on CIFAR-10, where there is not enough data to train sufficient shadow models for it to perform well. We find LiRA tends to be strongest with ScoreRatio, but ScoreDiff performs comparably.
For the multiple update setting, there is similarly no single best choice for the attacker. For example, loss ratio outperforms loss difference on CIFAR-10, while loss difference outperforms loss ratio on FMNIST and Purchase100, and there is little difference on IMDb.
For a specific use case, an adversary might experiment on their own shadow updated model to determine whether to use ScoreDiff or ScoreRatio, and which privacy score to use. Thresholding strategies are determined by what the adversary has access to.
ScoreDiff vs. ScoreRatio in terms of precision/recall. The broad takeaways for precision are similar to those for accuracy. SGD-New outperforms SGD-Full, as we can see from comparing Table 3 and a table we included in the Appendix, Table 7. At up = 0.01 0 , Batch achieves 90% precision on FMNIST for SGD-New, while the best attack achieves 86% precision for SGD-Full. This difference is more pronounced for other datasets. Precision is larger when fewer update points are used: Transfer achieves 86% precision with up = 0.01 0 on FMNIST, but only 78% precision with up = 0.08 0 . For precision, we have three strategies for selecting the threshold: Batch, Transfer, and also Rank. While Rank often achieves a higher recall than the other strategies, Batch and Transfer typically achieve a higher precision. Thus, once again, the choice of algorithm and its configurations requires experimentation.
Delta vs. Back-Front in terms of membership inference accuracy. For the multiple update setting, we proposed two membership inference strategies for the attacker: Delta, which compares each adjacent pair of models; and Back-Front, which only compares the first and last models. Figure 5 compares these two strategies for SGD-Full in terms of membership inference accuracy. As before, no strategy strictly dominates. The Delta attack performs the best for CIFAR-10. On Purchase100, the Back-Front attack is best. On FMNIST, the Back-Front and Delta attack perform comparably. As before, an adversary could decide between one strategy versus another based on experiments with a shadow model that they train themselves.
Q4 answer: These results show that the variety of algorithms and configurations that we proposed is truly needed, because no single algorithm and configuration will work best in all situations.

Impact of Distribution Shift
Q5: How does distribution shift impact attack performance? An important characteristic of the model retraining setting we tackle in this paper is that models are updated to keep up with a shifting distribution. How distribution shift impacts MI attack performance has been evaluated very little in prior literature, hence its investigation, in the context of our proposed attacks, is a significant contribution of our work. The contribution consists of two components: (1) a new methodology that we developed for evaluating impact of distribution shift on MI attacks and (2) the evaluation of our proposed attack algorithms with this methodology.
Methodology. We consider two types of distribution shift: subpopulation shift and covariate shift.
We consider subpopulation shift using the BREEDS framework [37]. The BREEDS framework generates subpopulation shift by generating a hierarchy of classes and shifting between classes which are close in the hierarchy. For example, a "dog" class trained on images of dalmatians may struggle to recognize poodles as dogs. We adapt BREEDS to CIFAR-10 with an "animal vs. vehicle" binary task, and vary the animals and vehicles to simulate distribution shift. "Animals" are { bird, cat, deer, dog, frog, horse}; "vehicles" are { airplane, automobile, ship, truck }. For the "animal" class, we consider a source class and a target class (both class 0 in the binary task). Likewise, the "vehicle" class has a source class and target class (both class 1). We write the distribution of a given class as D . We consider balanced classes, so that the source distribution is D = 1 2 (D + D ) and the target distribution For covariate shift, we use the CINIC-10 dataset [11]. CINIC-10 is a combination of the CIFAR-10 dataset and the ImageNet dataset, filtered to the 10 classes of the CIFAR-10 dataset. The covariate shift we consider is from a D consisting of purely CIFAR-10 data (or purely ImageNet data) to a mixture of an fraction of CIFAR-10 and a 1 − fraction of ImageNet data. This is not a form of subpopulation shift as CIFAR-10 and ImageNet do not represent different subpopulations of each class, rather a different distribution of examples for each class (with ImageNet consisting of harder data). We also describe in Appendix C an experiment using CINIC-10 to confirm our subpopulation shift experiments on a larger dataset.
We run our single update attacks from Section 4.1 on the preceding distribution shift methodologies. Our goal is to isolate the role of distribution shift on attack performance. For both settings, we vary     [5,26].
the parameter and the "hardness" of shift, keeping everything else constant. For subpopulation shift, we consider two settings of ( → ), ( → ). The first, which we call Hard, is (Airplane → Automobile), (Cat → Bird). The second, which we call Easy, is (Automobile → Truck), (Cat → Dog). Hard is a distribution shift where the original model will not perform well on the new data, due to the dissimilarity between original and update classes. Easy is a distribution shift where the original model will perform well. For covariate shift, we call starting with CIFAR-10 as the Hard setting, as this model will perform poorly on ImageNet data, and starting with ImageNet is the Easy setting, as this model will perform better on CIFAR-10 data. Importantly, our methodology does not measure the ability of an adversary to distinguish the old and new distributions. This is because our MI game formulation from Section 3 samples test points and update points identically, from the same distribution. Instead, we are measuring the ability to distinguish shifted training points from shifted testing points. Intuitively, a large distribution shift requires the model to fit to the specific update points to accommodate the new distribution, thereby making the update points vulnerable to MI. It is worth noting that the only prior work investigating distribution shift's impact on privacy, namely Zanella-Béguelin et   Table 5: (Q5) Accuracy and precision/recall of attacks after covariate shift on SGD-New. Ratio, Diff stand for ScoreRatio, ScoreDiff, respectively, with loss score. Here, Ratio performs best for accuracy, and a Hard distribution shift results in more accurate attacks than an Easy shift, but no longer has as significant an impact as with subpopulation shift. al. [45], shows that some outputs are more likely after a distribution shift, but their experiments cannot isolate whether this is a privacy violation or just the model adapting to the new distribution. We thus believe that our methodology can constitute a better platform for future measurements of MI under distribution shift.
Evaluation. The second component of our distribution-shift contribution is the evaluation of our proposed algorithms using the preceding methodologies. We focus on the single update attacks, and use the Batch thresholding strategy. For the subpopulation shift setup, we use up = 50 and 0 = 5000. The covariate shift is a more complex task, so we set up = 1000 and 0 = 10000. We evaluate accuracy and precision/recall for both shifts with ∈ {0.2, 0.6, 1.0}, with results for subpopulation shift shown in Table 4 and results for covariate shift in Table 5. We show results for SGD-New, using both ScoreDiff and ScoreRatio with the loss score (SGD-Full results for subpopulation shift can be found in Table 10 in the Appendix). Accuracy: ScoreRatio is the most effective strategy in all cases. For the Hard subpopulation shift, the faster the change in distribution (larger ), the more effective MI is. As increases from 0.2 to 1.0, accuracy on SGD-Full increases by 0.07, and accuracy increases by 0.13 on SGD-New. Interestingly, there is no strong trend with for covariate shift, likely because the task is complex enough that it leads to high risk regardless (our attacks with covariate shift tend to have high accuracy and precision). Similar to how SGD-New is more influenced by up in Section 5.4, SGD-New is also more heavily influenced by . Interestingly, accuracy decreases as grows for the Easy shift. This shows that a drastic shift requires significant changes to be made to the model, leading the model to overfit to the specific update points and make them more vulnerable.
Precision/Recall: The lessons for precision/recall are similar to those for accuracy, but precision tends to be much higher than accuracy, often reaching >90% (and as high as 98%), especially at more drastic subpopulation shifts. For subpopulation shift, we still find that ScoreRatio is generally more effective, and Hard shifts and large result in higher precision. For covariate shift, neither attack clearly outperforms the other, and precision is always high, even with small and Easy shifts. The key difference with accuracy we note is that even the Easy shift results in large precision at high (and even at low for covariate shift); this is likely because, even when the distributions are similar, there are some samples which still require the model to change significantly to learn them.
Disparate Impact: Finally, we consider whether our attacks have disparate impact [23,46]. That is, we ask whether some subpopulations are more impacted by our attacks. We focus on this question here, due to the explicit distinctions made between source and target distributions. For covariate shift, we find no disparate impact. This is likely due to the lack of distinction between subpopulations. For subpopulation shift with CIFAR, we do find disparate impact: with = . 2 Q5 answer: A drastic subpopulation shift can result in higher MI risk than a gradual shift, but not if MI risk is already high.

Differential Privacy
Q6: How does differential privacy impact our attacks?
A rigorous strategy for preventing our attacks is by training with differential privacy. Indeed, training with ( , )-differential privacy, imposes some upper bound on the accuracy of any MI attacks and also on the precision at a fixed level of recall. We evaluate the effectiveness of this strategy by training with DP-SGD [1,3,41], the standard algorithm for training differentially private models. This algorithm modifies standard SGD by clipping gradient norms and adding noise. We use the implementation provided by Tensorflow Privacy [28]. We fix = 10 −4 , and vary the noise multiplier to reach fixed values of , computed with the accounting provided by the repository. We focus on the Fashion-MNIST dataset, with an update size of 100, a clipping norm of 0.5, and we fix all other parameters to be the same as in previous sections. We focus on the single update setting for simplicity. This model retains accuracy similar to the non-private model (around 75% test accuracy) when epsilon is at or above 0.2. However the test accuracy decreases below 65% at an epsilon of 0.06.
Protection wanes with . We present the results of this experiment in Figure 6a. As expected, as increases, its protection from our attacks decreases. For example, at = 0.26, our best attack reaches a precision of only 0.59, but at = 1.1, it reaches a precision of 0.63. The precision levels off as increases. The gap between the attacks with and without access to updates is also largest at moderate values. At < 1, there is little difference. The no-update attacks also catch up at extremely large , where the noise addition is minor, and gradient-clipping is the major difference between DP-SGD and the standard implementation of SGD.
Auditing differentially private deployments with our attacks. Following [20,33], we can use our results to provide empirical lower bounds on the privacy of each update algorithm, as a means of understanding how worst-case upper bounds on correspond to practical privacy against state-of-the-art attacks. We can convert the bound used by either [20] or [33] to a bound on precision, giving that precision should be bounded by /(1 + ), or, identically, that is lower bounded by /(1 − ). Since we cannot measure directly, we follow [20] and compute conservative estimates via Clopper-Pearson confidence intervals. We have 400 trials for each , as we train 20 models with 20 points in each trial. We report in Figure 6b these computed values for both SGD-New and SGD-Full. We enforce a confidence of 98% for each reported value. We highlight two key takeaways. First, as in non-private training, SGD-New empirically offers less privacy protection than SGD-Full for those points in the update. In fact, our attack does not refute the possibility that SGD-Full satisfies differential privacy with = 0, although we stress that this is not robust evidence that privacy is not a concern when retraining with SGD-Full. The second takeaway is that the provable upper bounds on the privacy of SGD-New are nearly tight for moderate values of . With provable in .12-1.09, our lower bounds are within a 2.0-3.6x factor of the theoretical upper bound. A gap smaller than 3.6x is perhaps remarkable, since state-of-the-art attacks on standalone models trained with DP-SGD have gaps of 5-10x [20,33], suggesting that model updates represent an especially risky scenario for private model training.
Q6 answer: Our results show that differential privacy is an effective protection at lower , and also that our attacks can be an effective method of empirically auditing a differential privacy deployment.

Comparison with
Chen et al. [9] Q7: How do our attacks compare to those developed for unlearning in Chen et al. [9]?
We observe that machine unlearning can be viewed as the "reverse operation" of our model update setting. Then we can adapt the MI attacks designed for machine unlearning in Chen et al. [9], and compare them to ours. In their strategy, the adversary trains shadow models to learn an "attack model". This attack model takes as input some combination of the probability vectors returned by the two models, and outputs a prediction for whether the point was unlearned. They experiment with different instantiations of the attack, and we reproduce their SortedDiff attack in our setting, which takes the difference between sorted probability vectors before and after deletion. They note SortedDiff is their best attack on well-generalized models, as our models are. We run their attack with up to 30 shadow models, with each of the attack model architectures tested, and report the best attack from these. We focus on our loss score attacks with Transfer thresholds, as this fits the threat model they consider (LiRA requires training different types of shadow models, so we avoid this comparison for simplicity). Our attacks here will therefore only use a single shadow model (to set the threshold), while their attacks will be allowed up to 30.
We compare ScoreDiff and ScoreRatio with the Chen et al. [9] attack in Figure 7, for both SGD-Full and SGD-New, on the FMNIST dataset. We observe that both ScoreDiff and ScoreRatio always outperform their attack, although the gap can be somewhat small, depending on the update size. We are able to do this with a single shadow model, because it is easier to identify a good threshold on a single feature, than learn a good function on 10 features 3 . We corroborate this on Purchase100.
To further demonstrate the strength of the test statistics we compute, we run an experiment allowing the Chen et al. attack model to access our ScoreDiff and ScoreRatio features, in addition to the features they use (this results in 12 features, two of ours, and 10 SortedDiff features from their paper). We allow this improved Chen et al. attack 30 shadow models, and have it learn a logistic regression attack model on these 12 features. We inspect the weights the model learns, as a way to measure how useful the features are, and find that the average weight assigned to our features is, on average, 7.5x higher than the weight assigned to one of the Sorted-Diff features! This speaks to the value of carefully constructing a useful test statistic, rather than attempting to learn one from a high dimensional space. Shadow models are more useful when used to improve a simple test statistic, as our results with LiRA show.  [9] relative to our attacks. We consistently outperform their attack, despite only using a single shadow model to set our attack threshold.
Q7 answer: ScoreDiff and ScoreRatio are more efficient (fewer shadow models) and more effective (higher accuracy) than [9].

CONCLUSIONS
We have presented and evaluated MI attacks which leverage model updates. Our attacks apply to a variety of settings, including when models are repeatedly updated and when the distribution shifts over time. Our strategies are theoretically justified and empirically achieve both high accuracy and high precision. Empirically, we find the role of the update set size, the training algorithm, and any distribution shift to be key factors impacting our attacks' performance. As a general rule, the smaller the update set size, the more effective our attacks are. This holds true for attacks on single updates as well as multiple updates, and for both SGD-New and SGD-Full. It is well known that MI attacks are more successful when training sets are smaller [14,36], so our results confirm this in the model updates setting. Since model accuracy also improves more with larger updates, this is a nice win-win for privacy and utility! A drastic distribution shift also improves the performance of an attack, as learning the new distribution requires fitting heavily to the update points. Zanella-Béguelin et al. [45] also notice that distribution shift results in memorization in the generative language model setting, but our results are the first to identify this as privacy leakage, rather than the model adapting to a new distribution. Finally, the specific training setup used by the learner can impact the accuracy of MI. Models updated repeatedly can be used to improve MI, and can also leak the time that a data point appeared in the training set. Learners training with SGD-New are more vulnerable than SGD-Full at small update set sizes and when distributions shift significantly, but this trend tends to reverse as the update set gets larger and the distribution is more stable.  We now prove Theorem 4.1, which shows that rich score information is required to take advantage of updates. We begin by restating the theorem and assumptions.
•  (Theorem 4.1, restated). For any single-model membership inference attack A returning only a binary IN/OUT prediction satisfying the above assumptions, there exists an adversary receiving the output of A on 1 that obtains at least as high accuracy as any adversary with access to the output of A on both 0 and 1 .
Proof. Given only knowledge of 1 , the adversary may only make a decision based on A ( 1 ). Then a uniform decision must be made for points with A ( 1 ) = IN, representing a 11 + 01 fraction of update points and a 11 + 01 fraction of test points. Because there is a balance between update and test points, there will be more updates points in this set when 11 + 01 > 11 + 01 , so the optimal attack should guess IN when this inequality holds and OUT otherwise. The fraction of overall points which are correctly classified by this rule will be 1 2 (max( 11 + 01 , 11 + 01 )). Similarly, points for which A ( 1 ) = OUT should be guessed as IN if 10 + 00 > 10 + 00 and OUT otherwise. Then the optimal attack achieves accuracy 1 2 max( 11 + 01 , 11 + 01 ) + max( 10 + 00 , 10 + 00 ) . Now, Assumption 2 states that the attack guesses IN on update data more than on test data, so max( 11 + 01 , 11 + 01 ) = 11 + 01 ; the optimal attack should guess IN when A does. Similarly, max( 10 + 00 , 10 + 00 ) = 10 + 00 , so the optimal attack should guess OUT when A does. The optimal attack here is to simply use predictions returned by A. This optimal attack strategy reaches an accuracy of 1 2 ( 11 + 01 + 10 + 00 ). Using Updates. Now, an attack with access to both 0 and 1 may make decisions based on both attack results. Following the same argument as before, we see that a 11 fraction of update points are guessed as IN for both models, while a 11 fraction of test points are, so the optimal attack should classify these points as update if 11 > 11 and test otherwise. Applying the same logic to all four possible attack results, we see that the optimal attack achieves accuracy 1 2 max( 11 , 11 ) + max( 01 , 01 ) + 1 2 max( 10 , 10 ) + max( 00 , 00 ) .
Again, Assumption 2 gives max( 11 , 11 ) = 11 and max( 01 , 01 ) = 01 . Assumption 1 states that the attack returns IN equally often on update and test points on 0 , so combining Assumption 1 and Assumption 2 gives 10 < 10 , so that max( 10 , 10 ) = 10 . Finally, Assumption 1 implies that 01 + 00 = 01 + 00 . Combining this with Assumption 2 gives us that 00 < 00 , so max( 00 , 00 ) = 00 . Then the accuracy of the optimal attack when given both 0 and 1 is 1 2 11 + 01 + 10 + 00 , equivalent to the attack which did not use updates. The attack in both cases is identical -follow the guesses made by A when run on 1 . □
For the ℓ 2 2 loss, the update rule is When recomputing the mean, the loss after update is a fixed ratio decrease from the loss before the update. Meanwhile, the probability that a randomly drawn test point will have the same loss ratio is 0. Then, with a single update point and a known learning rate, loss ratio is a perfect membership test.
For the ℓ 2 loss, the update rule is which shows that recomputing the geometric median results in a fixed constant decrease from the loss before the update, making it also a perfect membership test in this setting.

B.2 Mean Estimation
In this section we give a more detailed treatment of Example 1.1 from the Introduction. Namely, we analyze the effect of updates in a very simple case: a single update of a rudimentary "model" that estimates the mean over a multi-dimensional dataset.
We consider the task of estimating the mean of samples drawn from a -dimensional spherical Gaussian distribution D = N ( , I × ). We consider a learner train = up which simply outputs the sample mean of its training data, and which produces two models-the first,ˆ0, is computed on a dataset 0 of 0 samples, and the second, 1 , is computed with an additional 1 samples. The total dataset = [ 0 ; 1 ] contains = 0 + 1 samples. We consider an adversary who seeks to identify whether a given was contained in 1 .
In this setting, we can upper bound the performance of any attack when the adversary has no access to model updates (that is, it only has access toˆ1). Next, we show that, when given access to model updates (bothˆ0 andˆ1), the adversary can outperform this upper bound.
Theorem B.1. Consider membership-inference using mean estimation in R with an initial dataset of size 0 and a single model update with a set of size 1 . If then there is an attacker with access to bothˆ0 andˆ1 that outperforms every adversary with access to onlyˆ1. Here Φ( ) = Pr[ (0, 1) ≤ ] is the Gaussian CDF.
The condition (1) is always satisfied when ≫ ≫ 1 , in which case the left-hand side is close to 1/2 and the right-hand side is close to 1. We now prove this statment, by breaking it into two lemmas. Lemma B.2 upper bounds all adversaries with access to onlyˆ1. Lemma B.3 analyzes a specific attack usingˆ0 andˆ1 together. Combining these lemmas proves Theorem B.1.
Lemma B.2. For the task of membership inference of a sample on mean estimation, with probability > 1 − exp(− ) over the selection of , all adversaries A have success rate bounded above by Proof. Notice thatˆ0 is distributed as N ( ,

I).
We can upper bound the success of this adversary by a function of the total variation (TV) distance between the two distributions: Recall that the TV distance between N ( 0 , Σ 0 ) and N ( 1 , Σ 1 ) can be upper bounded [12] by We use this to bound (D OUT , D IN ) as follows: Because − ∼ N (0, 2 I), we have ∥ − ∥ 2 < √ 5 [24] except with probability at most exp(− ), which gives us This completes the proof. . This holds with probability > 1 − 2 exp(− /16) over the choice of , and when 1 > 1.
Proof. We consider an adversary with access to model updates, receiving two quantities. The first is the mean of 0 ,ˆ0 ∼ N ( , 2 0 I). The next is the overall meanˆ1. When is not contained in 1 ,ˆ1 is distributed as 0ˆ0 + 1 N ( , 2 1 I). When is found in 1 ,ˆ1 is distributed as 0ˆ0 + 1 + 1 −1 N ( , 2 1 −1 I).
With both of these quantities, the adversary computes the mean of only 1 :ˆD elta = 1ˆ1 − 0 1ˆ0 . The task of determining whether is contained in 1 can now be written as the task of distinguishing between the distribution ofˆD elta when is not included, D Delta,OUT , and the distribution when it is included, D Delta,IN , both written below: In the OUT case, (ˆD elta , ) is distributed as N (0, ).
The adversary guesses OUT if (ˆD elta , ) is below = 1 2 1 ∥ − ∥ 2 2 and IN otherwise. If D Delta,IN and D Delta,OUT had equal variance, this would be the optimal Neyman-Pearson distinguisher [34]; because the variances are similar, the test will still be effective. For convenience, we write = ∥ − ∥ 2 2 2 .
The probability the adversary succeeds when is OUT is

and the probability of success when is IN is
The adversary achieves accuracy This completes the proof. □

B.3 ScoreDiff Achieves High Accuracy
We now show that ScoreDiff with the loss score on mean estimation achieves high accuracy when >> 1 .
Theorem B.4. Suppose 0 is the mean of 0 , and 1 is produced by taking a single gradient step from 0 with a learning rate of on the ℓ loss. Then there is some constant and threshold such that, if > 1 , running ScoreDiff with a threshold of reaches a membership inference accuracy of >90% for both SGD-Full and SGD-New.
In the following, we write the mean of 1 as up . When excluding a sample , we write the mean of 1 / as rest . We begin by proving Lemma B.5.
Lemma B.5. For the task of mean estimation with an update 1 , when 0 is the mean of the original dataset 0 , a gradient step with learning rate using SGD-Full is equal to a gradient step using SGD-New with a learning rate of ′ = 1 0 + 1 .
Proof. In SGD-New, the gradient step is performed on 1 is In SGD-Full, the gradient step on 0 ∪ 1 is as the gradient on 0 adds to 0, because 0 is the minimizer of ℓ on 0 . We see that these gradient steps are rescalings of each other, as we wanted to show. □ Having proven Lemma B.5, we can now prove Theorem B.4, by fixing an and analyzing the loss difference with SGD-New.
Proof. We write the loss for a fixed and , and will later consider the cases where is IN 1 and where is a test point, OUT of 1 .
Now, the norm ∥2 ( up − 0 )∥ 2 2 can be computed from 0 and 1 , so we consider only the distribution of the rightmost term = −4 up − 0 , − 0 , showing that this is smaller for update points than for test points. In the OUT case, we write as OUT , which is distributed as where each of the are independent samples from a Chi square distribution 2 . The mean of OUT is −4 0 2 , and its variance is In the IN case, we write as IN , distributed as where each of the are independent samples from a Chi square distribution 2 .  to bound our success probability from below as

Mean Estimation
To validate the improved performance of model updates for mean estimation, we test attacks on the mean estimation setup described earlier in the Appendix. For these experiments, we set = 0, 0 = 200, = 250, = 0.1, and we vary 1 . We experiment an attack which leverages updates, and one which doesn't.
In the no update case, we run the Neyman-Pearson optimal attack [34], as described in Algorithm 3. This attack, when provided aˆ1, computes the PDF values OUT   : Membership inference on non-updated models trained on small datasets (SGD-Full and SGD-New) vs. models updated on small datasets (LiRA). Attacks on models updated on small datasets appear to be more resilient to dataset size than models trained on small datasets.
In the update case, we run the corresponding Neyman-Pearson optimal attack, which is equivalent to the no update Neyman-Pearson optimal attack, replacingˆ1 withˆD elta . When provided 1 andˆ0, the attack computes the mean of 1 asˆD elta = 1ˆ1 − In Figure 8, we empirically evaluate these attacks. The immediate takeaway is that using model update never results in worse performance than not using updates. However, notice that this message holds in a variety of scenarios, even those which are not covered by our theoretical analysis. Because = 250 and 0 = 200, when 1 < 50, we see that even when > , the model update attack outperforms the no update attack, a setting our analysis performs poorly in. Also, even when 1 is much larger than 0 , we see the attack which uses updates does not perform poorly relative to the attack which does not. The improved performance from using model updates is robust to a wide range of parameter settings. Section 5.2 Experiments -Single Update. We report the precision and recall for SGD-New attacks in Table 7. These precision values are even higher than precisions reached by SGD-Full presented in Table 3.
Section 5.2 Experiments -Comparison to Single Update Attacks with Small Datasets. A model trained on a small dataset is more likely to overfit than one trained on a larger dataset, naturally leading to increased membership inference risk. As a result, it is perhaps unsurprising that we find that larger update set sizes are less vulnerable to our attacks. Here, we experiment on FMNIST to see if the increased vulnerability on small datasets is different between the two settings: a model trained on a small dataset, or a  Table 7: Precision/recall results for single update models on SGD-New. Loss, Batch, Transfer, and Rank are defined in Table 2. We report for two up values.
(a) Entry Inference Accuracy (b) Membership Inference Accuracy Figure 10: Entry inference and membership inference accuracy of multiple update attacks on FMNIST with up = 100, using loss difference. As in Section 4.1, SGD-New attacks perform worse relative to SGD-Full. model updated with a small dataset. The results of our experiment can be found in Figure 9. Here, we find that, while the standard LiRA attack is as strong as our adaptation on SGD-New, its performance drops much more quickly than our update attacks, so that the update setting leaks (according to our attacks) more than the fixed dataset setting. Furthermore, in practice, updating a model with a small dataset is likely to be tolerable, as it will not worsen an already accurate model, while training a model entirely on a small dataset will likely lead to an inaccurate model, which is unlikely to be deployed, making our finding here more worrisome. absence of distribution shift, we measure per-class attack accuracy on FMNIST in Figures 12a and 12b. We report single update SGD-Full and SGD-New accuracies with up ∈ {10, 160}. While there are no strong differences, Class 1 (trouser), is reliably easy to attack, with perfect accuracy at up = 10 for both SGD-Full and SGD-New. Class 7 (sneaker) also has somewhat higher vulnerability to our attacks.
Section 5.5 Experiments -Multiple Updates. We present in Figures 10a and 10b the results for attacks on FMNIST when up = 100, to compare against the results from Section 4.3, which use up = 10.
In Figure 10a, we show that, at larger update set sizes, attacks on SGD-Full can outperform attacks on SGD-New, in line with experiments from Section 4.1. The entry inference accuracy of attacks with up = 100 is also smaller than for attacks with up = 10, which also corroborates our experiments in Section 4.1.
In Figure 10b, we make similar observations to those in Figure 10a. Attacks are less powerful at large update set sizes, and attacks on SGD-Full perform better relative to attacks on SGD-New.
Section 5.6 Experiments -Distribution Shift. We now reproduce our subpopulation shift findings in the CINIC-10 dataset. Here, we use CINIC-10 to construct a dataset of ImageNet images corresponding to 9 classes from CIFAR-10 (all but bird). Again using the BREEDS framework, and we select source and target subclasses which should have "Easy" transfer based on how close they are in the ImageNet sysnet tree (we select leaf classes which share a parent node), and we select "Hard" subclasses by selecting leaf classes which only share the CIFAR-10 label, making them far apart in the tree. We detail these classes in Table 8. We also need to make sure these classes are well represented, so we only select subclasses with more than 200 images in CINIC-10 (due to the large number of bird classes, we could not find bird subclasses fitting this requirement, so we omit it). We then use an update set size of 900, and vary the shift ratio between 0.2, 0.6, 1.0, as before. We report accuracy and precision of our attacks in Table 9. Interestingly, the correlations we observed in our CIFAR-2 dataset for BREEDS do not hold up, similarly to the covariate shift case with CINIC-10. This is likely to be because the attacks are already highly successful without much shift, making it harder for these shifts to improve upon them.    Table 10: (Q5) Accuracy and precision/recall of attacks after subpopulation shift. Ratio, Diff stand for ScoreRatio, ScoreDiff, respectively, with loss score. Ratio performs best, a Hard distribution shift results in better attacks as increases, and SGD-New is typically more vulnerable than SGD-Full.