Losing Less: A Loss for Differentially Private Deep Learning

Differentially Private Stochastic Gradient Descent, DP-SGD, is the canonical approach to training deep neural networks with guarantees of Differential Privacy (DP). However, the modifications DP-SGD introduces to vanilla gradient descent negatively impact the accuracy of deep neural networks. In this paper, we are the first to observe that some of this performance can be recovered when training with a loss tailored to DP-SGD; we challenge cross-entropy as the de facto loss for deep learning with DP. Specifically, we introduce a loss combining three terms: the summed squared error, the focal loss, and a regularization penalty. The first term encourages learning with faster convergence. The second term emphasizes hard-to-learn examples in the later stages of training. Both are beneficial because the privacy cost of learning increases with every step of DP-SGD. The third term helps control the sensitivity of learning, decreasing the bias introduced by gradient clipping in DP-SGD. Using our loss function, we achieve new state-of-the-art tradeoffs between privacy and accuracy on MNIST, FashionMNIST, and CI-FAR10. Most importantly, we improve the accuracy of DP-SGD on CIFAR10 by 4% for a DP guarantee of 𝜀 = 3.


INTRODUCTION
Releasing machine learning (ML) models may risk the privacy of training data as ML models unintentionally leak information about the training data; this is due to limitations of learning, including overfitting [22] and data memorisation [5,10]. The framework of Differential Privacy (DP) by Dwork et al. [8] is the gold standard for formalising privacy guarantees. When training ML models, a DP training algorithm provides guarantees bounding the leakage of private data that may occur during training. This requires that one can bound the influence of individual training examples on the output of the algorithm (e.g. parameters or gradients of the model). In addition to these strong theoretical guarantees, DP models are empirically resistant to various attacks -membership inference attacks [20,21], training data extraction [5] and data poisoning attacks [15].
Differentially Private Stochastic Gradient Descent, DP-SGD, trains a ML model under the framework of DP by (1) clipping per-example * Work done mostly while the author was at the Vector Institute. gradients to a fixed norm to bound their sensitivity and (2) adding noise to these clipped gradients before they are applied to update the model parameters [1]. When done separately and at the level of a minibatch, clipping [16,29] or noising [9] effectively regularize learning because they respectively control the dynamics of iterates and smoothen the loss landscape. However, gradient clipping and noising are detrimental to model performance in DP-SGD because they are applied in combination and at the granularity of individual training examples. In addition to this, the accuracy of DP-SGD is negatively affected by the limited number of iterations that can be computed: each iteration of DP-SGD increases the privacy budget expended by learning.
Recent works have explored bounded activation functions [19], publicly available feature extractors and data [23], randomised smoothing [24] and gradient embedding [27] as a means to improve the tradeoffs between privacy and accuracy of DP-SGD. However, training networks with strong DP guarantees in DP-SGD still comes at a significant cost in accuracy for datasets like CIFAR10.
In this paper, we show that one key aspect of formulating the optimization problem solved in learning is left out of studies seeking to improve deep learning with DP: the loss function. All existing implementations of deep learning with DP we are aware of optimise cross-entropy (which performs well in the non-privacy-preserving setting with SGD). However, we first observe that optimising crossentropy with DP-SGD leads to exploding model weights, layer pre-activations and logit values. This is true even if the activation functions themselves are bounded [19]. This phenomenon makes it difficult to control the sensitivity of the learning algorithm at a minimal impact to its correctness. Indeed, clipping and noising large gradients in DP-SGD introduces an information loss and additional biases. More precisely, the direction of a minibatch's clipped perexample gradients in DP-SGD is not necessarily aligned with the direction of the original gradients in SGD. Furthermore, we also note that the slow training convergence of cross-entropy [2] negatively impacts the performance of DP-SGD for which a limited number of iterations are required to achieve strong privacy guarantees.
In this paper, to tackle these two limitations of the cross-entropy loss, we propose to tailor the loss function to the specifics of DP-SGD. We design a novel loss function that takes into account sensitivity, training convergence and the order in which examples are learned with the overarching goal of improving the tradeoffs between privacy and accuracy of DP-SGD. We achieve these objectives with a novel loss function that combines the Sum Squared Error (SSE) between the model logits and ground-truth labels, the focal loss [13], and a regularisation penalty on pre-activations.
The first two penalties impose a curriculum structure in the training to improve privacy-accuracy tradeoffs of DP-SGD by exploiting the generalisation and fast convergence speed [11,25] of curriculum learning [4]. We start training with SSE which has faster convergence [2] and smaller gradients than cross-entropy. Then, we gradually shift toward emphasising hard-to-learn examples using the focal loss: this down-weights losses assigned to examples that are already learned correctly to focus more on hard-to-learn and misclassified examples. Finally, our regularisation penalty prevents the weights from exploding to further reduce the magnitude of perexample gradients prior to clipping. Implementation-wise, our new loss function can be easily integrated within existing implementations of deep learning with DP; it requires a single line change to edit the loss function being optimized.
In summary, our main contributions are as follows: • We are the first to investigate the loss function in the context of deep learning with DP. We analyse the effect of the loss function on the sensitivity of DP models by providing extensive analysis of the norm of activations, weights and gradients in Section 4. • We propose a novel loss function 1 tailored to deep learning with DP. In Section 3 and Section 4, we analytically and experimentally connect the superior performance of our proposed loss function to the gradient norm and direction, smoothness of the loss surface, and convergence of DP-SGD. Our loss function better controls the sensitivity of the learning algorithm by better preconditioning gradients to being clipped and noised. • We show in Section 4 that our proposed loss function is compatible with other DP-SGD improvement strategies, and more importantly allows us to establish new state-of-theart tradeoffs between privacy and accuracy on several key benchmarks for deep learning with DP, including CIFAR10. Finally, we perform an ablation study to analyse the effect of each component of our proposed loss function on DP learning.

DIFFERENTIAL PRIVACY, DP-SGD AND STATE-OF-THE-ART APPROACHES
Trained machine learning models memorise and leak information about their training data [10,[20][21][22]. Differential privacy [8] is the established gold standard to reason about the privacy guarantees of learning algorithms. An algorithm is differentially private if its outputs are statistically indistinguishable on neighbouring datasets. More formally, a randomised learning algorithm is ( , )-differentially private if the following holds for any two neighbouring datasets and ′ that differ only in one record and all ∈ Range( ): The privacy budget upper bounds the privacy leakage in the worst possible case: the smaller the , the tighter the upper bound (or stronger the privacy guarantee). Note that the factor is very small (generally it is chosen to be in the order of the inverse of the dataset size).
To train ML models with DP, we can add randomness to the learning algorithm in three ways: output perturbation [26,30], objective perturbation [6,12] or gradient perturbation [1,3]. Among these approaches, differentially private stochastic gradient, DP-SGD of Abadi et al. [1], has established itself as the de facto strategy because of its versatility. DP-SGD perturbs gradients computed at each step of stochastic gradient descent using the Gaussian mechanism to achieve competitive tradeoffs between privacy and accuracy [23,28]. Next, we describe the process of training a classifier using DP-SGD.
In each DP-SGD iteration, similarly to the non-private SGD one, the classifier receives each training example x from a minibatch containing data points that are sampled randomly from X. The activation of each intermediate layer of size , a = (h ) ∈ R 1× , is computed by applying a non-linear function, (·), on the pre-activation h = W a −1 . The pre-activation is a linear combination of the weight of that layer, W , and the activation of the previous layer, a −1 . The last layer outputs the logit values associated with each class without any activation function as a = h = W h −1 ∈ R 1× . The cross-entropy loss, L CE , for each training example x is The probability that x belongs to -th class, ( , ) ∈ [0, 1], is computed by applying a Softmax function on the logit value of its corresponding class ( , ) . The DP-SGD optimiser computes per-example gradient of L CE with respect to W, as opposed to per minibatch gradient in the SGD optimiser, to bound the influence of each individual training example on the output of the learning algorithm (i.e. weight gradients). The weight gradients, G , for each training example x are computed from the last layer and back-propagated until the first layer as: One can continue this gradient computation until the first layer. All the per-example gradients are concatenated as where ; denotes the concatenation operation. As there is no a priori bound on G , the 2 norm, ∥ · ∥ 2 , of each are artificially clipped by to bound the influence of each x on the final gradients, G: Finally, the classifier weights are updated through the average of the clipped and noisy (scaled by the noise multiplier ) per-example gradients of a minibatch. The accuracy of classifiers trained with DP-SGD is lower than that of classifiers trained by non-private SGD. Recently, researchers investigated the choice of (·) and data representation for improving the accuracy of the classifier trained by DP-SGD. Papernot et al. [19] demonstrated that exploiting a bounded family of activation functions instead of the more commonly employed unbounded ReLU activation for the choice of (·) can decrease the bias introduced by DP-SGD and improve the tradeoffs between privacy and accuracy of DP-SGD. Their approach though does not fully bridge the gap between non-private and private learning for datasets like CIFAR10. Tramèr and Boneh [23] proposed to train the classifier on a representation of data outputted by a public ScatterNet feature extractor as opposed to using the raw pixels x . This however requires access to public data in addition to the private dataset.

PROPOSED APPROACH
We propose to design a new loss function tailored to DP-SGD. Our goal is to converge faster with smaller gradient norms, control better the sensitivity of the learning algorithm, increase tolerance to noise, and effectively learn both easy and hard examples. Indeed, these address several crucial differences between DP-SGD and its non-private counterpart SGD: • Due to per-example gradient clipping (recall Equation 6), information contained in gradients whose magnitude is too large is discarded (see Figure 1.c). This cannot be compensated by tuning the training algorithm, e.g., by increasing the model's learning rate. This is because clipping is done at the granularity of individual training examples. Together with the noise injected (recall Equation 6), this biases learning and implies that DP-SGD takes a different optimisation path compared to SGD. • As shown in Equation 3, the model weights 2 contribute to the computation of all gradients. In practice, this leads to model weights exploding in DP-SGD. Therefore, one way to decrease the magnitude of gradients is to prevent the model weights from exploding, especially the weight of the last layer which contributes in the gradient of all the layers. • The number of training iterations are limited in DP-SGD as each iteration increases the risk of privacy leakage. Hence, faster convergence is beneficial to DP-SGD as well as ensuring both easy and hard examples are attended to during the limited training run. Next, we describe our choice of loss function. We analyse how our new loss function improves the tradeoffs between privacy and accuracy of DP-SGD. We uncover two main effects: our loss limits the information loss of gradient clipping and improves learning's tolerance to noise.

Our loss
We propose to learn the easy examples at the beginning of the training using the per-example Sum Squared Error (SSE) between the logit values and one-hot vector labels: Later in training, we exploit the focal loss L Focal to learn hard-tolearn examples. L Focal modifies the cross-entropy loss to reduce the loss of easy, well-classified examples, letting the model focus more on hard, misclassified examples. L Focal multiplies a factor (1 − ( , ) ) based on the probability of the ground-truth class ( , ) to the cross-entropy of each training example x as: where the tunable focusing parameter adjusts the down-weighting rate: the higher the , the higher the down-weight rate of easy, wellclassified examples. This factor is multiplied only by the loss, and it is done prior to gradient clipping so that still the contribution of each example to the gradient is bounded by . Note that the focal loss is equivalent to cross-entropy loss, when = 0.
To further decrease the gradient magnitudes and prevent the explosion of intermediate weights, we impose a regularisation penalty on the intermediate pre-activations as: Finally, our proposed loss function L combines L Focal , L SSE and L Reg as: where we set hyper-parameter = Sigmoid( − ) (current epoch, , and threshold epoch, ) to enable curriculum learning where easy examples are learned in the early training iterations using L SSE which eases learning of hard examples in the later training iterations using L Focal . L Reg acts as a regulariser, thus we multiply its effect by 1/ in comparison to the other two losses. We perform a hyper-parameter search to set the best values for , and . Privacy analysis. Note that we still follow the privacy guarantees of DP-SGD (Theorem 1 in [1]) as we i) do not modify the data sampling requirement of DP-SGD; ii) compute gradients on a perexample basis (∇L) by computing a per-example loss in Equation 10; iii) bound the contribution of each example to the gradient by clipping the 2 norm of these individual per-example gradients (see Equation 5); and iv) add noise to these gradients (see Equation 6) that is calibrated such that Theorem 1 in Abadi et al. [1] holds.

An analysis of our loss
Our loss limits information loss from clipping. Figure 1 shows that under the supervision of L, we are able to control the sensitivity of the learning algorithm. In particular, L Reg on the intermediate pre-activations and L SSE on the logit values prevent pre-activations and logit values and consequently weights from exploding (perexample logit values and pre-activations are computed based on the weights), which decreases the magnitude and variance of perexample gradients. Using the triangle inequality for the gradient of in unclipped ones when we train the classifier on CIFAR10 using cross-entropy loss or our proposed loss function. Minimising cross entropy loss using DP-SGD leads to logit exploding (a) and weight exploding (b). However, our proposed loss function prevents logit exploding (a) and weight exploding (b). Therefore, per-sample gradient magnitudes (c) obtained based on our proposed loss function is smaller than per-sample gradient magnitudes obtained based on cross entropy loss. In addition, the per-sample gradients of our proposed loss function is much more condensed. Finally, direction of the gradients (d) of our proposed loss function is more aligned with the unclipped gradients. for example layer − 1 in Equation 3 we have: where our pre-activation penalty on h −2 and SSE loss on the logit values a = W h −1 decrease ∥a −2 ∥ 2 = ∥ (h −2 )∥ 2 and ∥W ∥ 2 , resulting in smaller upper bound for ∥G −1 ∥ 2 . Controlling W can also limit the changes of weights in other layers as their gradients are a function of W and per layer gradient function is locally Lipschitz. Therefore, the gradients computed when optimising our loss function with DP-SGD is smaller in magnitude and more condensed than for gradients computed when optimising cross-entropy as shown in Figure 1.c. Smaller and near-symmetric gradients decrease the negative impact of gradient clipping [7] on the trajectory of the gradient descent as shown in Figure 1.d.
Our loss improves tolerance to noise. Wang et al. [24] demonstrated that smoother loss functions favor DP-SGD as injecting noise to weight gradients near each sharp local minimum can significantly increase the loss thus lowering prediction accuracy. Their theoretical and empirical analysis suggest that making the loss surface more tolerant to noise can improve the generalization bounds and accuracy of DP learning. The most common smoothness notion is based on the Lipschitz constant of the gradient of a function.

Definition 1 (Smoothness [17]). A loss function
where ∇L is the gradient of the loss function with respect to the weights and is the smoothness constant. Smaller indicates a smoother function.
We now derive a lemma illustrating how our loss also favours smoothness of the loss being optimised by DP-SGD. Lemma 1. When singular values are bounded by max , we have for all and ′ in a simply connected set that Proof. Let be the hessian of the loss function with respect to the weights . ∥∇L − ∇L ′ ∥ 2 = ∥( where max is the maximum of the first singular value along the line from to ′ .
Therefore, the existence of local bounds on the singular values results in local bounds on the smoothness. We empirically observe in Figure 2 that the singular values of our loss function are smaller than cross-entropy loss function. This, thus, suggests that our proposed loss function is smoother and more noise-tolerant than cross-entropy.

VALIDATION
We validate the benefit of deploying our proposed loss function in improving how DP-SGD tradeoffs privacy and accuracy. To do so, we show our loss further improves the state-of-the-art results of the two approaches for deep learning with DP described in Section 2: (1) CE-T-CNN [19]: using the cross-entropy loss function together with an end-to-end CNN classifier equipped with tanh activation functions; (2) CE-T-PFE_CNN [23]: using the cross-entropy loss function, this time combining the ScatterNet public feature extractor (PFE) with a private CNN classifier whose internal activations are also tanh.
To ensure a fair evaluation, we use the same datasets, MNIST, Fash-ionMNIST and CIFAR10, and the same classifiers as these two baseline approaches. Table 1 and Table 2 show the architecture of endto-end CNNs for MNIST, FashionMNIST and CIFAR10, suggested by Papernot et al. [19]. Table 3 and Table 4 show the architecture of CNNs trained on ScatterNet features for MNIST, FashionMNIST and CIFAR10, suggested by Tramèr and Boneh [23].
In addition to this, we fix the DP-SGD configuration across all approaches to the best hyperparameter values reported in the hyperparameter search done by Tramèr and Boneh [23] (see Table 5). Note that our results would only be improved by additional hyperparameter search on the DP-SGD configuration. For the hyperparameters introduced by our own loss function, we use a search on the values (see Table 5). Note that = 0 does not mean that the SSE term is disabled in our loss function; for example in the first epoch ( = 0) the weights of both Focal and SSE losses are the same as = Sigmoid( − ) (see Equation 10), but the weight of Focal loss increases as increases. Note that none of the prior work we compare against captures the privacy cost of the hyperparameter search itself when reporting DP guarantees achieved. The cost of this search should however be moderate, as analysed by Liu and Talwar [14], Papernot and Steinke [18].   When learning without privacy, end-to-end classifiers achieve a test accuracy on CIFAR10, FashionMNIST and MNIST of 76.6%, 89.4% and 99.0%, respectively. Instead, when learning with DP-SGD, the maximum test accuracy (across 5 runs) of our first baseline   CE-T-CNN for < 3 are 59.8%, 86.9%, 98.2% on CIFAR10, Fashion-MNIST and MNIST, respectively. This illustrates how CIFAR10 is the most challenging dataset for DP-SGD training among these datasets: there is about 20% gap in accuracy between the privacypreserving and non-privacy-preserving settings. Our proposed loss function reduces this gap significantly, even more so for CIFAR10 than simpler datasets like FashionMNIST and MNIST. For example, training an end-to-end classifier using our loss function on CIFAR10 achieves 63.2% test accuracy, instead of 59.8% with the cross-entropy loss. We also conducted statistical tests to corroborate that our proposed loss function leads to statistically significant improvements. In particular, we conducted statistical tests on the final accuracy (i.e., accuracy of the model at the end of the training using test data) achieved with our loss versus the final accuracy achieved with CE loss function using 5 different initialization seeds as follows: (1) We performed a Shapiro-Wilk test to determine normality of the distribution of final accuracies across different initializations with the null hypothesis that data come from Normal distribution. We achieved a p-value of 0.24 which is above significant level (i.e., 0.05), failing to reject the null hypothesis; (2) We performed Levene's test to determine the homogeneity of variances across different initializations. We achieved a p-value of 0.86 which is above 0.05, suggesting that the data follows the assumption of equal variance; (3) We conducted a t-test with the null hypothesis that our mean accuracy is similar to CE's mean accuracy. We achieved a p-value of 10 −6 which is less than significant level (i.e., 0.05), rejecting the null hypothesis. Therefore, the average accuracy of our loss function is statistically different from the average accuracy of CE.
While our second baseline CE-T-PFE_CNN sees stronger accuracy than the first baseline CE-T-CNN because the learner additionally has access to public data, our proposed loss function also improves its ability to tradeoff privacy and accuracy across the entire regime. The improvement of our proposed loss function to privacyaccuracy tradeoffs is strongest in the initial epochs. These epochs correspond to stronger privacy guarantees (i.e., smaller values of ) because the privacy guarantee of each individual step of DP-SGD needs to be composed across the entire training run; hence each step further increases the privacy budget expended. For example, our proposed O-T-CNN trains an end-to-end classifier with 55.4% test accuracy for < 1.5, compared to 45.9% only with our first baseline CE-T-CNN. In another example, O-T-PFE_CNN improves the test accuracy of our second baseline CE-T-PFE_CNN from 64.5% to 66.8% for < 2.

Analysis of Gradients computed by DP-SGD
We first analyse the impact of the loss function on the model weights, pre-activations, and logit values. Figure 5 shows the impact our loss function has on model weights. We visualise the average 2 norm of model weights at each epoch. Across the three datasets, optimising CE-T-CNN with DP-SGD leads to model weight explosion. This is not the case when learning with our proposed loss. In particular, note how O-T-CNN prevents last layer weights from exploding, which is important because they contribute to the gradients for all layers. Next, Figure 6 shows the impact of our loss function on pre-activation and logit norms. Although CE-T-CNN uses bounded tanh activation functions to prevent activations from exploding, the pre-activations and logits still explode because model weights are unbounded and explode themselves (as visualized above). Instead, replacing the cross-entropy loss function with our loss function decreases this phenomenon for both pre-activations and logits. This is explained by our pre-activation regulariser but also the fact that the sum-squared-error is defined over logits.
Reducing the bias of gradient clipping. We provide the histogram of per-example gradient 2 norms in Figure 7. The gradient 2 norms of O-T-CNN are smaller than their counterpart for CE-T-CNN. This is a consequence of our observations above: our proposed  loss function prevents several terms including in the gradient computation from exploding. To further demonstrate the importance of better preconditioning gradients to the clipping performed by DP-SGD and its negative impact on the trajectory of the gradient descent, we visualize the divergence between the path taken by aggregated clipped per-example gradients and aggregated unclipped per-example gradients in Figure 8. For each minibatch, we compute the cosine similarity between the average of per-example clipped gradients and the average of per-example unclipped gradients. The direction of clipped gradients in O-T-CNN is more aligned with the direction of their unclipped counterparts than is the case for the baseline approach. This suggests that our loss function decreases the information loss and bias introduced by DP-SGD. Learning hard vs. easy examples. Finally, we visualise the confusion matrices of per-class training accuracy for non-private learning with SGD, private learning with the CE-T-CNN baseline, and private learning with our approach O-T-CNN in Figure 9. Qualitatively speaking, the per-class training accuracy of O-T-CNN is closer to the non-private one than is the case for the baseline CE-T-CNN.
During the initial epochs of learning, O-T-CNN converges faster especially on easy-to-learn examples thanks to the sum-squarederror loss function that converges exponentially ( − ), where is the steps, to 0 loss, while cross-entropy loss only converges at (1/ ) Allen-Zhu et al. [2]. 3 For example, the training accuracy of O-T-CNN for "car" and "truck" are 51% and 48%, while CE-T-CNN only achieves 34% and 37%, respectively, on these classes. In addition to this, O-T-CNN deals better with hard-to-learn examples (e.g. "cat", "dear" or "dog" classes) in later epochs thanks to the focal loss. For example, the training accuracy of "cat" class is 49% at epoch 20 of O-T-CNN, but is only 35% for CE-T-CNN.

Ablation study
In this section, we present an ablation study to • tease out the contribution of each of the three components of our proposed loss, namely sum-squared-error, focal loss and regularisation penalty; 3 Appendix A discusses the importance of designing a loss function tailored to DP-SGD for better convergence rate rather than other choices of the optimizers or learning rates.
• show the low sensitivity of the performance of our loss function to its hyperparameter values when it comes to improving privacy-accuracy tradeoffs in deep learning with DP.
Contribution of each of our loss's components. We first consider four combinations of these penalties: only using sum-squarederror, a combination of sum-squared-error and cross-entropy, a combination of sum-squared-error and focal loss, and finally our proposed loss function that combines all three-sum-squared-error, focal loss and regulariser. We do not consider other combinations because they are not meaningful. To ensure a fair evaluation, we perform a hyperparameter search for the combination of sum-squarederror and cross-entropy as well as the combination of sum-squarederror and focal loss, similarly to the search performed for our proposed loss function. The privacy-accuracy tradeoffs of these four combinations is shown in Figure 10. While the sum-squared-error improves the privacy-accuracy tradeoff in initial epochs, the focal loss improves it for later epochs. Finally, imposing a regularisation penalty on pre-activations further improves these privacy-accuracy tradeoffs. Our proposed loss thus achieves the best possible performance across each of these settings and none of its penalties can be omitted. In addition, we compare the privacy-accuracy tradeoffs of CE, SSE and Focal loss in DP-SGD training. Figure 11 shows that SSE outperforms CE in terms of test accuracy in initial training iterations, and Focal loss outperforms CE in later iterations. We motivate these findings mainly based on the smaller gradients and faster convergence of SSE and better learning of hard examples by Focal loss.

O-T-CNN CE-T-CNN
Next, we replace our pre-activation regulariser with weight decay regulariser, L WD-Reg , defined as: We use the best coefficients of each of this regulariser to have a fair comparison. Figure 13 shows the tradeoffs between the privacy and accuracy of DP-SGD training using SSE+Focal+PreAct_Reg and SSE+Focal+WD_Reg in our final proposed loss function on all three datasets. Both pre-activation regulariser (PreAct_Reg, a regularisation penalty on the intermediate pre-activations as defined in Equation 9) and weight decay regulariser (WD_Reg) help to improve the tradeoffs between privacy and accuracy of DP-SGD training. However, the improvement of pre-activation regulariser in both initial iterations and the maximum accuracy is higher than the one introduced by the weight decay regulariser. Sensitivity to the hyperparameter values. We evaluate the performance of our loss function by varying the value of hyperparameter values of our loss function on all datasets in Figure 14. We can observe that the impact of hyperparameter values on the performance of our loss function is negligible. Furthermore, we reused DP-SGD hyperparameters that had previously been shown to perform well with CE. Therefore, the improvement of our loss function is not only sensitive to the hyperparameter values it introduces but also compatible with existing hyperparameter tuning for DP-SGD.

CONCLUSION
In this paper, we showed that designing a loss function tailored to the specificities of DP-SGD, namely per-example gradient clipping and a limited number of iterations, significantly improves tradeoffs between privacy and accuracy in all datasets used in existing DP-SGD improvement strategies. We reduced the bias introduced by gradient clipping using a pre-activation regularisation term. In addition to this, we improved the convergence speed by imposing a curriculum structure in learning with the sum-squared-error and focal loss. As future work, we will evaluate our loss function for DP-SGD in other domains such as text.

A EFFECT OF OPTIMISER AND LEARNING RATE ON DP-SGD CONVERGENCE RATE
Prior work found that the use of adaptive optimizers such as Adam does not provide benefits for private learning. For instance, SGD was compared to Adam in Table 3 of Papernot et al. [19] and the discussion in Section C.5 of Tramèr and Boneh [23]. This result is due to the fact that adaptive optimisers in DP-SGD carry noise from one gradient descent step to the next to adapt learning rates, therefore inadequately slowing down training. The choice of learning rate can affect the tradeoffs between privacy and accuracy in DP-SGD training. Therefore, we choose the best learning rate values reported in the hyperparameter search done by Papernot et al. [19] and Tramèr and Boneh [23] to also have a fair comparison. Further tuning the learning rate would only strengthen and improve our results. Note that even the best learning rate values cannot compensate for the information that is discarded from the gradients due to per-example gradient clipping. This is because clipping is done at the granularity of individual training examples. Together with the noise injected, this biases learning and implies that DP-SGD takes a different optimisation path compared to SGD.