Privacy Preserving Feature Selection for Sparse Linear Regression

Privacy-Preserving Machine Learning (PPML) provides protocols for learning and statistical analysis of data that may be distributed amongst multiple data owners (e.g., hospitals that own proprietary healthcare data), while preserving data privacy. The PPML literature includes protocols for various learning methods, including ridge regression. Ridge regression controls the 𝐿 2 norm of the model, but does not aim to strictly reduce the number of non-zero coefficients, namely the 𝐿 0 norm of the model. Reducing the number of non-zero coefficients (a form of feature selection ) is important for avoiding overfitting, and for reducing the cost of using learnt models in practice. In this work, we develop a first privacy-preserving protocol for sparse linear regression under 𝐿 0 constraints. The protocol addresses data contributed by several data owners (e.g., hospitals). Our protocol outsources the bulk of the computation to two non-colluding servers, using homomorphic encryption as a central tool. We provide a rigorous security proof for our protocol, where security is against semi-honest adversaries controlling any number of data owners and at most one server. We implemented our protocol, and evaluated performance with nearly a million samples and up to 40 features.


INTRODUCTION
The Machine Learning (ML) revolution critically relies on large volumes of data to attain high confidence predictions.However, the massive amounts of data collected on individuals and organizations incur serious threats to security and privacy.From a legal perspective, privacy regulations such as the European General Data Protection Regulation (GDPR) and the California Customer Privacy Act (CCPA) aim at controlling these threats.From a technological perspective, Privacy Preserving Machine Learning (PPML) [44] attains ML utility without exposing the raw data, 1 e.g., by leveraging tools for secure computation [27,32,68].Many PPML solutions focus on the inference phase, e.g., [11,30,33].A growing body of literature also addresses training, covering a range of learning tasks and techniques, including training decision tree models, e.g., [44], as well as linear regression, logistic regression and neural networks models, e.g., [48].Recently, attention has been drawn also to the task of privacy preserving feature selection [43], which -in the context of linear regression-is the focus of our work.
Feature selection [35] is the process of selecting a subset of informative features to be used for model training, while removing non-informative or redundant ones.Feature selection is important for avoiding overfitting, that is, for producing models that support high quality prediction on new data rather than models that essentially "memorize" the training data set while failing to generalize.Moreover, feature selection supports producing sparse models whose use for prediction in practice requires measuring only the few selected features.Sparsity is often a desired property, leading to significant cost savings when the model is repeatedly used for prediction, especially when feature extraction is expensive.For example, sparse models are highly desired in medical applications where feature extraction might involve medical interventions with associated financial and morbidity costs (see Section 3.1).The focus of this work is on feature selection under  0 constraints,2 specifically for wrapper methods.In contrast, previous work on privacy preserving feature selection considered only filter [15,43,52] and embedded methods [1, 3, 4, 19, 26, 28, 40, 44, 45, 49, 61-63, 65, 67, 70, 70], but not wrapper methods.We next elaborate on the differences between these methods.
Selecting the subset of features of highest predictive power is computationally intractable [14]; nonetheless heuristic methodswhich can be categorized into filter, embedded and wrapper methods -are widely employed in practice with great success.Filter methods assign a score to each feature in isolation, outputting the highest scoring features; they are typically very fast, but often fail to select the best features because they ignore relationships and dependencies between features.Embedded methods intrinsically incorporate feature selection into the model training process, e.g., in decision tree regression, as well as in ridge and Lasso regression; The latter two techniques penalize models according to their ( 2 and  1 respectively) norms, favoring models with fewer high-weight features.However, they do not necessarily return a sparse model: the model often includes a tail of low-weight features (particularly in ridge regression).Wrapper methods select features in an iterative process, with a target ML algorithm in mind (e.g., linear regression).In each iteration: (1) a model is trained on the current subset of features, then (2) features are ranked according to an evaluation metric measuring their usefulness for that model, and (3) this ranking is used to specify the subset of features for the next iteration.Wrapper methods are considerably slower than filter and embedded methods, due to the multiple rounds of model training, but often lead to superior outcomes for the target ML algorithm.
In summary, none of the aforementioned methods strictly dominates the others (see, e.g., a comparative study in [39]).Consequently, common practices often examine several methods in search of the best performing one (using cross-validation to avoid overfitting) or use a hybrid approach to combine the strengths of several methods.Examples for hybrid approaches include using a filter method to quickly remove many features, then selecting from the remaining ones using an embedded or wrapper method (see, e.g., [18,58]); or, executing wrapper and embedded methods in parallel, each scoring the features, where the output includes the features with highest total score (see, e.g., [16,17,34]).
Thus, providing privacy preserving solutions to a wide variety of feature selection methods is essential to supporting versatile usecases.

Our Contribution
In this work we present the first privacy-preserving feature selection solution in a wrapper method, producing a sparse regression model under  0 constraint (i.e., with exactly the specified number of features).Concretely: Our protocol.We design the Secure Iterative Ridge regression (SIR) protocol: a new protocol that produces a sparse ridge regression model, over input data that is (horizontally)-partitioned among distrusting data-owners, and where the bulk of the computation is outsourced to two non-colluding servers (aka, two-server model).
Horizontal partitioning means that each data owner contributes a subset of data samples.Our solution utilizes homomorphic encryption [27,53] as a central tool.The wrapper method that we realized in a privacy preserving fashion is the commonly used Recursive Feature Elimination (RFE) algorithm, introduced in the highly influential paper [36].RFE starts by considering all features, iteratively training a model on the current set of "surviving" features, and removing features of lowest weight (the number of removed features is a tunable parameter), until reaching the  0 constraint.The ML model that we train at each iteration is a ridge regression model, so that at the termination of all iterations we obtain a sparse regression model.As a central new component we introduce a scaled ridge regression protocol, which may be of independent interest.
Privacy & Threat Model.Privacy holds against all semi-honest computationally-bounded adversaries controlling any number of data owners and at most one of the two servers.The security guarantee is that such adversaries cannot infer any information on the inputs of data owners that they do not control, except for what can be efficiently computed directly from the public parameters, and the inputs and outputs of corrupted parties. 3Our security proof for SIR covers the case of overdetermined linear regression (more data samples than features and full rank).To demonstrate the necessity of the security measures implemented in SIR, we devise explicit attacks showing that simplified variants without these measures are insecure.Complexity.The complexity of each data owner is proportional only to her input and output size (and polynomial in the security parameter).The two servers engage in a two-party protocol with round complexity logarithmic in the number of input features , and complexity that is cubic in  (up to poly-logarithmic factors) and logarithmic in the number of input records  (and polynomial in the security parameter); the protocol includes homomorphic computations of multiplicative depth at most  (log + log log ).System and empirical evaluation.We implemented our protocol SIR and empirically evaluated its performance.Our system is generic and can be applied to any dataset with numerical features.As a concrete example, we ran experiments on a gene-expression dataset derived from TCGA [59].The full data matrix taken from TCGA breast cancer data consists of gene expression profiles for 781 samples.Each profile, representing a human subject, consists of over 10 values.Our proof of concept data represents regressing the expression pattern of a target gene (a vector with 781 entries), based on arbitrarily selected 40 other genes.Furthermore, we artificially randomly partition the 781 instances amongst 10 data owners.We note that horizontal partition, as in our experiments, has interesting use cases in analyzing gene expression related data, e.g., in [6,8,29,57].In our experiments, each iteration removes 10% of the features, and the  0 constraint is to produce a model consisting of 8 features.
We point out that the initial selection of 40 out of 10 features would typically be executed using filter methods (due to their fast runtime), which can be done using the prior art on privacy preserving filter methods [15,43,52].We focus therefore on selecting the 8 out of 40 remaining features using SIR -our privacy preserving RFE-which is the contribution of our work.
Our experiments demonstrate that our system produces the desired model -i.e., the same model as produced in the clear.Running in the clear takes seconds while our privacy preserving system terminates in just under a day, using 134GB RAM.Scalability.Our protocol scales favorably with the number of input records and data owners, as demonstrated by our empirical evaluation on up to 802,816 records and 1000 data owners, where a 512× growth in the number of records (respectively, 10× growth in the number of data owners) led to runtime increase by 10% (resp., 1%).

Comparison to Prior Work
Prior privacy preserving feature selection solutions were in filter or embedded methods, whereas ours is the first in a wrapper method (concretely, a privacy preserving RFE).Although none of these methods dominates the others, RFE was developed in [36] for the use case of gene expression data, where it excelled.Indeed, the mean square error (MSE) of the model produced by our system significantly outperforms the regression models produced by filter, ridge and Lasso (the baseline); see the full version [2] for details.
In terms of runtime, our system is expected to be slow, since, even in cleartext, wrapper methods are considerably slower than filter and embedded ones.This is also evident by our experiments, where the runtime of our system (on 40 features) is roughly 1 day, compared to runtimes between slightly under a minute, and a few hours, on various dataset sizes (with 20-120 features) reported in previous works [4,15,26,28,40,43,49,52,62,63,70,70].
Producing better models (i.e., with smaller MSE) -as in SIR-is typically desired, even if it incurs a slower (but reasonable) training runtime.This is because training a model occurs once, whereas the fruits of having a better MSE provide recurring benefits in each use of the model.Furthermore, a runtime of 1 day (and even more) seems to be reasonable in the context of gene-expression data, where the trained model is used to guide the manufacturing of a medical testing device (DNA chip), and our training time is insignificant when compared to the manufacturing pipeline.
Furthermore, our runtime can be significantly lowered by tuning the system parameters to remove a larger fraction of the features in each iteration, thus reducing the total number of iterations.In particular, when tuning parameters so that our system returns a model satisfying the  0 constraint in a single iteration, our system terminates in only 1 hour (when executed on 20 features and 1000 data samples).This single-iteration parameter setting may be of independent interest, because it provides a privacy preserving truncated ridge model -i.e., a ridge regression model whose low weight features are truncated as to satisfy the  0 constraint.This should be used with caution though, as our experiments indicate that there is a tradeoff between runtime and MSE -where performing more iterations typically yields a better MSE.
In summary, SIR expands over prior work in offering the first privacy preserving feature selection in a wrapper method.This widens the PPML toolset to include the commonly used RFE, which leads to favorable learning outcomes in some use cases of interest.

Paper Organization
We give an overview of our techniques, along with a comparison to techniques used in prior work, in Section 2. Preliminary definitions appear in Section 3; the problem statement in Section 4. The SIR protocol appear in Section 5. Our system and empirical evaluation appear in Section 6. Conclusions appear in Section 7.

OVERVIEW OF OUR TECHNIQUES
In this section we give an overview of our techniques.We present the high level overview of SIR in Section 2.1, elaborate on its key components in Sections 2.2-2.3,discuss our attacks in Section 2.4, and compare our techniques to prior work in Section 2.6.

IR and SIR
We first describe the (insecure) feature selection algorithm, called Iterated Ridge (IR); and then present our Secure Iterative Ridge regression (SIR) protocol.
Iterated Ridge (IR, cf. Figure 1) is an RFE algorithm for sparse ridge regression, analogous to the sparse logistic regression algorithm used in [25].IR starts with all features, removing 10% of the features in each iteration, until reaching 2 •  features, where  is the user-determined target number of features, then removes features one-by-one. 4(This follows the paradigm of adjusting the learning rate to a smaller value when approaching the solution.)Selecting which features to remove in each iteration is done by solving ridge regression (see Section 3.1) on the surviving features to obtain an intermediate model, and removing the features whose weights (in absolute value) are in the bottom 10%.
Secure Iterative Ridge (SIR, cf. Figure 2) is executed between  data owners DO 1 , . . ., DO  and two non-colluding servers S 1 , S 2 .The data owners' inputs are a horizontal partition of the data matrix  ∈ R × and the target vector ì  ∈ R  ; i.e., there is a partition  1 , . . .,   of [] = {1, . . ., } so that each data owner DO  holds the restriction of  and ì  to rows with indices in   , denoted   and ì   .Let  denote a sufficiently large integer (as determined by Equation 4).In SIR we use homomorphic computation over encrypted data, which translates into computing on the underlying cleartext values with arithmetic modulo  , i.e., in the ring Z  (unless explicitly stated otherwise).The values in , ì  are assumed to be normalized to [−1, 1] (which is common in ML), and each data owner scales her data to integral values in [−10 ℓ , 10 ℓ ] for a common precision parameter ℓ.Moreover,  is set to be sufficiently large so that, despite computing modulo  , we are able to produce the same final output as if computing over the integers.
SIR is composed of four phases as detailed next.Phase I. Setup & Upload: S 2 generates a key pair (, ) for a fully homomorphic encryption scheme, and publishes . 5Each data owner DO  encrypts (entry-by-entry)   = (  )  •   and ì   = (  )  • ì   under , and sends the ciphertexts to S 1 who homomorphically combines them to obtain ciphertexts for: Phase II.Obliviously Permute Features: S 1 and S 2 engage in a 1message protocol that concludes with S 1 holding ciphertexts for where  is a random  ×  permutation matrix.(1) Scaled ridge (cf. Figure 3): First, S 1 and S 2 engage in a twoparty protocol, at the conclusion of which S 1 holds a ciphertext encrypting the scaled ridge regression model:  masked in the clear, encrypts them, and sends the ciphertexts to S 1 .Finally, S 1 homomorphically unmasks to obtain ciphertexts for the scaled (un-masked) model, using the algebraic identity: ì .We refer the reader to the full version [2] for the proof of correctness.
(2) Ranking: Next, S 1 and S 2 engage in a two-party protocol, at the conclusion of which S 1 holds a ranking -in cleartext-of the surviving features according to their weight in ì  scaled ; see details in Section 2.3.Importantly, S 1 does not know the actual weights, only their ordering.This does not violate privacy, because the features were randomly permuted; details are provided in the full version [2].
We elaborate on the scaled ridge regression and ranking subprotocols in the subsections below.
Phase IV.Obliviously un-permute and rationally reconstruct: At the onset of this phase the servers hold a subset  ⊆ [] of feature indices, which satisfies the  0 constraint, i.e., | | = .They now engage in a two-party protocol to produce a (cleartext) sparse ridge regression model whose non-zero weights are only on features in  . 7First, S 1 and S 2 engage in a two-party sub-protocol, at the conclusion of which S 1 holds ciphertexts for the ridge regression model ì for the final set of features  .Second, S 1 homomorphically un-permutes the features' order (where features that did not survive have weight zero), and sends this un-permuted encrypted model to S 2 who decrypts it, and sends it in the clear to S 1 .That is, S 1 now has ì  =  −1 | • ì  | , where the inverse and product are computed in Z  .Finally, S 1 maps ì  to the solution to ridge regression over the surviving features when computed over the rationals (rather than in Z  ) via rational reconstruction [22,64], 8and outputs the resulting model.

Our Scaled Ridge Protocol
A central new component in our solution is a sub-protocol that we call scaled ridge regression, which may be of independent interest.Our scaled ridge builds on the following two observations.Observation 1: Ranking is invariant to scaling.Ranking indices of a vector ì  according to the magnitude of their associated values  [] (in absolute value) is invariant to scaling the vector by a positive factor.Therefore, instead of ranking features according to their magnitude in the ridge regression model, we can rank them according to any (non-zero) scaling of this model.
Observation 2: homomorphic ranking is considerably faster with our scaling.We show that homomorphic ranking is considerably faster when we scale the ridge regression model as we do.To explain why this is so, we first provide some background and point out a key complexity bottleneck in homomorphic ranking of the ridge regression model.We then explain how to eliminate this bottleneck using scaling.
Let (, ì ) be a data matrix and target vector in a ridge regression problem, and set  =   • + and ì  =   • ì .The ridge regression model is: where the arithmetic in computing , ì  and ì  is over the reals.In our privacy-preserving solution we compute an encrypted model ì  via homomorphic computation over encrypted  and ì , and where the underlying plaintext computation is conducted in a finite ring Z  of integers modulo  (rather than over the reals).To ensure that the same model is obtained despite performing computations modulo  , two measures are required.First,  must be sufficiently large (cf.Equation 4) so that, e.g.,  mod  and ì  mod  are identical to  and ì .Second, rational reconstruction must be computed on each entry of the reduced model  −1 • ì  mod  to map it to the model ì  computed over the reals.Unfortunately, homomorphically computing rational reconstruction is a complexity bottleneck.
Next, we show that scaling the model by the factor det() allows us to eliminate the need for rational reconstruction.Relying on the algebraic identity  −1 = adj() det() (where adj() is the adjugate matrix of ), we can see that the scaled model ì Since computing adj() involves only multiplications and additions, then -when using a sufficiently large  , as we do-no wrap-around occurs when computing ì  scaled modulo  .Namely, computing over Z  produces an identical scaled model as when computing over the reals.This implies that we can directly rank the entries of ì  scaled mod  without executing rational reconstruction.Namely, we are able to avoid performing rational reconstruction in each iteration, thus eliminating a key complexity bottleneck.

Our Ranking Protocol
We rank features in the scaled ridge model of the current iteration by their absolute value.This is done by ordering features according to their "size", measured as their distance from the nearest multiple of  .This measurement of the "size" of a feature can be thought of as treating values larger than ( − 1)/2 as negative, and comparing the absolute value of the features.Denote the current scaled model by ì The ranking is computed as follows.First, S 1 masks each pairwise difference  2  − 2  mod  , with ,  ∈  , by homomorphically adding to it a uniformly random integer  , (modulo  ), and sends the masked differences to S 2 .Second, S 2 decrypts each masked difference  2  −  2  +  , mod  , converts it to a binary representation (in the clear), encrypts this bit-by-bit and sends the ciphertexts to S 1 .The encryption of this binary representation is under the key pair (  ,   ), generated by S 2 during setup, whose plaintext modulus is a small integer  larger than .Third, S 1 homomorphically compares each encrypted masked pairwise difference  2  −  2  +  , with his cleartext  , (where both are in binary representation, and while accounting for all possible overflows), producing an encrypted outcome bit  , which is equal to 1 if-and-only-if  2  >  2  and zero otherwise.Then, for each , S 1 homomorphically sums up the bits  , for all  to obtain a ciphertext encrypting Ord  , which is the number of indices  with magnitude  2  smaller than  2  (we assume all weights are distinct), and sends all these ciphertexts to S 2 .Fourth, S 2 decrypts all ordinals, computes the set of surviving features according to Ord 1 , . . ., Ord  (i.e., the set of highest-ranking features), and sends the indicator vector of this set (in the clear) to S 1 .
Importantly, in order to support fast homomorphic summation of these encrypted results, we instructed S 2 -when encrypting the weights in binary representation -to use a small plaintext modulus of size , where  >  is larger than the number of comparison results to be summed-up.Since  > , then there is no overflow when computing Ord 1 , . . ., Ord  , and so they are a permutation of {1, . . .,  } ranking the entries of ì  by their size.The Boolean operations computed during the aforementioned homomorphic comparison are then emulated in Z  (defining, for ,  ∈ {0, 1},   (, ) =  •  mod  and (, ) = ( − ) 2 mod ).This at most doubles the multiplicative depth of comparison, for the benefit of making the summation computation a linear function that requires no multiplications.
The above (simplified) description of the ranking protocol overlooked the following subtle point: in each iteration, the servers learn the ordering between the features in the current iteration's model.If the same permutation on features were to be used in all iterations, this knowledge from the execution of multiple iterations might reveal non-trivial information (similar to the issues which arise by performing scaled ridge on unpermuted features).To overcome this, we have S 1 apply a fresh permutation on the Ord  's before sending them to S 2 .The permutation is inverted at the end of this step, and does not affect the rest of the computation.Privacy is preserved because each iteration uses a fresh permutation for ranking (which is applied on top of the long-term permutation applied at the onset of the protocol).See the full version [2] for details.

Security Challenges and Attacks
The iterative training of intermediate models is inherent to wrapper methods (and does not occur in filter or embedded methods), and introduces several security challenges.Indeed, we show that revealing any of the following (which are revealed in standard IR) would violate privacy (see the full version [2] for further details): • The order by which features are removed.
• The intermediate models.
• The scaling factor det() in a scaled ridge model.
This demonstrates the necessity of the security measures implemented in SIR.Hiding this information while maintaining efficiency is a key challenge in designing SIR.

Discussion: Model Inversion Attacks
Our protocol (SIR) guarantees that only the output model (and whatever can be efficiently inferred from it) is revealed.This is the standard security goal in secure computation; however, it leaves open the question of how much information is revealed by the output model, and how to reveal even less.
Prior works [23,24,69] have shown that learning the model may indeed reveal non-trivial information on private inputs of honest parties.Most relevantly, [24] showed, in the context of genomic data for personalized medicine using a (non-sparse) linear regression model, that access to the trained model can be abused to infer private genomic attributes about individuals in the training dataset.Moreover, [24] showed that differential privacy is not effective for guaranteeing privacy because it is at odds with utility: when setting the privacy parameters sufficiently high to prevent their attack, the produced model does not retain sufficient clinical efficacy.
SIR is likewise susceptible to model inversion attacks.This is because sparsity of the model in itself does not guarantee security against inversion attacks, since even a single bit of information can compromise privacy.For example, in the context of geneexpression data, certain genetic variants are more common in some ethnicities than in others, and so revealing whether a feature corresponding to such a genetic variant is selected in the model may reveal the ethnicity of the population represented in the training dataset.Concretely, we present a model inversion attack showing that corrupt data owners who inject maliciously crafted inputs (but otherwise follow the protocol) can infer from the output model non-trivial information on the inputs of the honest data owners.The attack holds even when the output is a 1-sparse model, i.e., it consists of one selected feature.At a high-level, corrupted data owners use "balanced" inputs, in which all features have the same correlation with the response vector.Thus, inputs provided by corrupted parties do not affect the output model, and any correlations in the inputs of the honest parties will determine which features are selected in the sparse output model.See the full version [2] for further details.
To reduce the privacy risk associated with revealing the model, our solution SIR offers a fine grained control on how, and to whom, the model is revealed.This can be obtained by applying the following minor changes in Phase IV: "Obliviously un-permute and rationally reconstruct" step.
Alternative 1: Revealing the output model in cleartext to all participants of the protocol (but nothing more).This is the outcome when executing the protocol as specified in Section 2.1.
Alternative  9 homomorphically compute the inner product of the model and sample, homomorphically mask the resulting value in Z  (using additive mask), and send the ciphertext to S 2 for decryption; then unmask the returned plaintext and apply rational reconstruction to obtain the inference result as a rational number.
See the full version [2] for further details.We note that in order to reduce the efficacy of model inversion attacks it is advised to enforce a security policy restricting the entities authorize to make inference queries, and limiting their number of allowed queries.
Analyzing such security measures is beyond the scope of this work.

Comparison to Prior Techniques
The two-server approach for privacy-preserving ridge regression dates back to the work of Nikolenko et al. [49], who were also the first to propose using additive homomorphic encryption for merging the data from all data owners.We follow them in our Setup & Upload phase.Solving the linear system (, ì ) was done in [49] using garbled circuits to guarantee security.Giacomelli et al. [28] proposed instead to solve a masked linear system ( masked , ì  masked ), where S 1 masks the systems using homomorphic addition, S 2 decrypts, solves, and sends the solution (the model) -in cleartext-to S 1 , who unmasks and applies rational reconstruction in the clear.Importantly, in [28] the model is sent in the clear, and both the unmasking and the rational reconstruction are done in the clear.However, this is impossible in our setting: we cannot expose the intermediate models in the clear, because this compromises privacy as we show in our attacks.Moreover, we cannot simply have S 2 send the model in encrypted form and have S 1 process it homomorphically, because the rational reconstruction step requires many sequential steps, making it computationally expensive to compute homomorphically.
Blom et al. [10] proposed avoiding rational reconstruction by having S 2 send adj( masked ) • ì  masked and det( masked ) (in the clear), where S 1 unmasks to obtain adj() • ì  and det() and then computes the model  −1 ì  = adj() ì /det() with division over the reals.However, we cannot follow their approach, because we cannot send these values in the clear (as this would violate privacy, as we show in our attacks).Moreover, we cannot send the values in encrypted form, and have S 1 homomorphically compute the division, because computing division homomorphically (be it modulo  or over the reals) is computationally expensive and thus not a viable alternative.In fact, even in the final iteration where the model can be revealed, our attacks demonstrate that it still violates privacy to reveal the pair (adj() ì , det()) rather than only their ratio adj() ì /det().We therefore cannot employ the approach of [10].Nonetheless, their approach inspired us in proposing our scaled ridge regression protocol.
Our proposed scaled ridge regression offers a new formulation, which eliminates the need for both rational reconstruction and division.In our protocol, S 2 sends -in encrypted formthe pair adj( masked ) • ì  masked and det( masked ), which S 1 homomorphically unmasks to obtain ciphertexts for the scaled model  scaled = adj() • ì .Next, S 1 homomorphically ranks the features of this scaled model, without computing any division or rational reconstruction.This is novel to our work, and may be of independent interest, with potential usage in other privacy preserving solutions using ridge regression as a component in a larger computation.
The overall structure of our protocol is likewise novel to our work, including its components of obliviously permuting and unpermuting the features (Phases II and IV in SIR) as well as the iterative execution of privacy-preserving regression for the ranking and removal (Phase III in SIR).We note that this overall structure necessitates using a fully homomorphic encryption (FHE), e.g., BGV [13] or B/FV [12,21], to support both homomorphic addition and multiplication with respect to our plaintext modulus  (cf.only additive homomorphism in [10,28]).However, as observed in [4], existing FHE libraries (e.g., HElib [38] and SEAL [54]) only support plaintext modulus of size up to 64-bit, whereas our protocol requires much larger integers (1260-bit integer in our implementation).To resolve this issue we follow [4] who suggested using the Chinese Remainder Theorem to represent each integer modulo  as a tuple of integers modulo small(ish) primes as supported by these libraries.
For a vector ì  we denote the number of its non-zero entries by nnz ( ì ), and call it -sparse if nnz( ì ) ≤ .For a vector ì  (similarly, set ), we use |ì  | (| |) to denote its length (size).For a matrix  , we use   to denote its 'th row, and   to denote its transpose.For a matrix , we use adj () , det () to denote the adjugate matrix and determinant of , respectively.We note that for any pair ,  of matrices it holds that adj ( • ) = adj ()•adj (), and det ( • ) = det () • det ().For natural ,  ∈ N, we use GL (, Z  ) to denote the group of all invertible  ×  matrices with entries in Z  .
For a real value , we use abs () to denote its absolute value, and ⌊⌉ to denote its nearest integer.We extend the notation to apply to vectors and matrices entry-by-entry.We say that  ∈ R has precision ℓ if  is given as a real number with ℓ digits after the decimal point (which could be 0).If  has precision ℓ, then by scaling  to lie in Z we mean multiplying  by 10 ℓ .
For  ∈ N, [ ] denotes the set {1, 2, . . .,  }, and Z  denotes the ring of integers modulo  .In our protocols, elements of Z  are represented using the integers 0, 1, . . .,  − 1 (this is without loss of generality).We treat values in Z  that are greater or equal to  /2 as negative.In particular, this allows us to define the "size" -alternatively, the "absolute value" -of a group element as its distance from the nearest multiple of  .For example, 1 is considered to be smaller than  − 2 ≡ −2 mod  .
For random variables R, R ′ , R ≈ R ′ denotes that R, R ′ are computationally indistinguishable.negl () denotes a function which is negligible in , and PPT is shorthand for Probabilistic Polynomial Time.We use the standard notion of computational indistinguishability (e.g., from [31]).Fully Homomorphic Encryption (FHE) [27,53] is an encryption scheme that allows computations to be performed over encrypted data ("homomorphic computation"), producing an encrypted version of the result.Computation is specified by an arithmetic circuit over a ring called the plaintext ring.Our protocol employs FHE as a black-box.See a tutorial in [37].

Sparse Linear Regression & Applications
Linear regression is an important and widely-used statistical tool for modeling the relationship between properties of data instances ì   ∈ R  (features) and an outcome   ∈ R (response) using a linear function ŷ = ì   ì  (the feature vectors are augmented with an additional first entry set to 1, as is standard).Training a regression model, takes  data instances ( ì   ,   ) ∈ R +1 and returns a model ì  ∈ R +1 that minimizes a loss function, e.g., the Mean-Square-Error (MSE): where the rows of the matrix  are the  (augmented) vectors ì   ∈ R +1 .As we will discuss below, regularizing the solution ì  to Equation 1 is often beneficial, leading to LASSO regression [9,60] and ridge regression, both are special cases of controlling the norm of ì .Ridge regression [4,28,41] seeks to find where notation is as above and  ≥ 0 is the regularization (hyper)-parameter.Lasso seeks to find: In certain cases it is desired (or even required) that the output model ì  be sparse.That is, we are seeking a model ì  with many zero coefficients.Even stronger -due to hardware limitations, for example-we would be seeking a model with a fixed number of features.The latter is called  0 sparsity, and leads to the following optimization task: where nnz( ì ) denotes the number of non-zero entries in ì , and  ∈ N is the sparsity (hyper)-parameter.This task is the one addressed in this work, and is referred to as sparse linear regression.
In typical datasets, learning sparse linear models is useful due to two main reasons.First, simpler models are preferred during the training stage to avoid overfitting [9,55].Lower complexity translates to lower degree of polynomial models and/or less features in the output model.The latter can be reduced to model sparsity.The second reason to prefer sparser models is due to practical considerations.In some cases, hardware limitations restrict the number of features which can be measured when using the prediction model in the execution phase -when used to predict values, , for new instances.In other cases, using more features in the execution prediction model is more expensive.For example, if  represents tumor severity, it might be reasonable to assume that  can be expressed as a linear (or polynomial) combination of molecular genomic information, say gene expression levels, in  .However, we expect, from a biological perspective, most genes to minimally affect the prediction performance.That is, the biology will be driven by a small number of genes. 10Therefore, most components of ì  can be zero so that an assay used in clinical practice, based on such a predictive model, can use less expensive hardware, quantifying the expression levels of fewer genes [7,20,25,50,66].

Feature Selection and Iterated Ridge
Feature selection is an essential component in computational modelling and in the practical application of models.It has therefore been an active and prolific field of research in various domains such as pattern recognition, machine learning, statistics and data mining [46,47].Clever selection of the set of features to be used for data modelling, and as part of the execution models derived from learning, has been shown to improve the performance of supervised and unsupervised learning.Reasons are discussed above, as well as in the literature [7,20,36,51] Feature selection methods can be classified into several types based on the employed techniques, as discussed in Section 1.In this work we focus on a variant of Recursive Feature Elimination [36], a wrapper approach.A detailed description of the approach, in the clear, follows.Our approach is an iterative one that starts with all features, and iteratively removes features.This is similar to [25], that developed a sparse logistic regression model using RFE.In each iteration we run ridge regression with  ≥ 0 [4,55] to calculate the weights for all features considered.Then, we remove features with low weights (in absolute value).The algorithm operates in two phases.In the first phase, we remove a 0.1-fraction of features, whereas when the current number of features decreases below a (user-defined) threshold thr, we move to the second phase, in which we remove a single feature in each iteration (this latter phase is analogous to Backward Subset Selection).The choice of the actual value of thr and the choice of the fraction removed in the early stages can affect the computational complexity of the process.Moreover, they are hyper-parameters of the model and can be tuned by cross validation.The pseudo-code of this algorithm is given in Figure 1.

PROBLEM STATEMENT
We follow the security and threat model of [4], and parts of this section are taken almost verbatim from [4].SIR guarantees computational security, in the passive setting, against a single server Input: A dataset  ∈ R × and a target vector ì  ∈ R  (where entries are normalized to the same scale), and parameters , rej ∈ (0, 1), thr ∈ [ ].Output: A set Ω  of the  ≤  selected features, and a ridge regression model ì   on these features.Steps: (1) Initialize Ω to be the set of all features, and ì colluding with a proper subset of the data owners.More specifically, we assume all parties, even corrupted ones, are PPT and follow the protocol (though corrupted parties will try to infer additional information).We guarantee correctness of the output, and privacy of the inputs, in this setting.Specifically, the only information revealed to the corrupted parties is the leakage profile, namely the information that is explicitly revealed by the protocol.In our protocols, the leakage profile consists of the output model ì , as well as the following public parameters: the number  of data instances; the number  of features; the precision ℓ; a sparsity parameter ; and a regularization parameter  ≥ 0.More formally, we consider -privacy in the passive setting, for inputs  such that  =    +  is invertible in the ring Z  (invertability is needed for IR correctness; and in our case -as in previous works [4,28] -also for privacy).We note that the input is horizontally-partitioned between the data owners (i.e., data owners hold disjoint subsets of rows of (, ì )).
Terminology.Let Π be an ( + 2)-party protocol executed between PPT data owners DO 1 , . . ., DO  and PPT servers S 1 , S 2 .We assume that every pair of parties share a secure point-to-point channel, and that all parties share a broadcast channel.We also restrict attention to protocols in which all parties obtain the same output, and only the data owners have inputs.For inputs  1 , . . .,   of DO 1 , . . ., DO  , we use Π ( 1 , . . .,   ) to denote the random variable describing the output in a random execution of Π (the probability is over the randomness of all participating parties, including the servers).For  ⊂ {DO 1 , . . ., DO  , S 1 , S 2 }, the (joint) view of  in Π, denoted V Π  ( 1 , . . .,   ), is the random variable consisting of the inputs and randomness of all parties in  , as well as the messages they received from the honest parties in a random execution of Π with inputs  1 , . . .,   .We say a subset  ⊆ {DO 1 , . . ., DO  , S 1 , S 2 } is -permissible if it contains at most  data owners, and at most one of the servers.
Security notion.We consider standard computational security against a passive adversary (see, e.g., [31]), adapted to the setting of non-colluding servers as in [49].Since optimal feature selection under  0 (Equation 3) is NP hard in general [14], we focus on providing a secure variant of the Iterated Ridge heuristic approach (see Section 3.1).Specifically, we require correctness in the sense that the secure variant has the same output as the cleartext iterated ridge algorithm, and privacy in the sense that any -permissible set  learns nothing except the leakage profile (which consists of the public parameters and the output model) and the inputs of the parties in  (and anything efficiently computable therefrom).We also offer the option of returning an encrypted model (cf.Section 2.5 ), in which case the output model is excluded from the leakage profile and the adversary learns nothing beyond the public parameters and the input of the parties in  (and anything efficiently computed therefrom).Following [28] we define correctness with respect to a subset T of inputs (where there is no correctness guarantee for inputs not in T ).Formally, Definition 4.1 (Secure Iterated Ridge Implementation).Let ,  ∈ N, let  be a security parameter, let D, R be an arbitrary domain and range, let  : D  → R be an iterated ridge algorithm (e.g., the algorithm of Figure 1), and let T ⊆ D  .We say that an ( + 2)party protocol Π is a secure iterated ridge implementation of  with -privacy for inputs in T with leakage profile L if: (1) Correctness: there exists a negligible function negl () : N → N such that for all inputs ( 1 , . . .,   ) ∈ T , where the probability is over the randomness of the parties.(2) Privacy: for every -permissible  there exists a PPT simulator Sim such that for every ( 1 , . . .,   ) ∈ T : V Π  ( 1 , . . .,   ) ≈ Sim   DO  ∈ , L .

SIR PROTOCOL
Our privacy preserving iterated ridge protocol SIR is specified in Figures 2-3 (with further details available in in the full version).See also an overview in Section 2; remarks on input encoding, parameter choice, and useful observations in Section 5.1.Our security and complexity analysis of SIR is summarized below.Complexity is stated in term of  and  where log  =  ( log ) (by Equation 4).
Complexity.Let E = (Gen, Enc, Dec, Eval) be the homomorphic encryption scheme with which SIR is instantiated, and Z  be the used plaintext ring, then: • Each data owner runtime is dominated by the time to compute  2 encryptions, and her communication complexity is dominated by transmitting  2 ciphertexts (in one round).• S 1 runtime is dominated by the time to rank features, which entails homomorphically evaluating  (log ) circuits, each with  ( 2 log  )=  ( 3 log ) multiplication gates and of multiplicative depth  (log log  )=  (log  + log log ).11 • S 2 runtime complexity is dominated by the time to solve (in the clear)  (log ) linear systems of size  × .• The communication of the two servers consists of  (log ) communication rounds, transmitting  ( 2 log  )=  ( 3 log ) ciphertexts in each round. 12ee the full version [2] for the proof.

Input, Parameters and Observations
Remarks clarifying some implementation details follow.
Remark 5.2 (Input representation and encoding.).We assume that the datasets entries are in the range [−1, 1], given with ℓ-digit precision.The inputs are scaled to be in Z  for a sufficiently large  (for the choice of  , see Remark 5.3).All subsequent computations in the protocol are performed in Z  or in Z  for some  ≥ .Note that  is much smaller than  .All inputs (and intermediate values generated during the computation) are encoded as in [4].
(We refer the interested reader to [4] for a detailed description of the encoding and its efficiency benefits.) Remark 5.3 (On the choice of  ).The plaintext ring Z  should be sufficiently large to guarantee that all computations during the (scaled) ridge regression step emulate the corresponding computations over the reals (i.e., no overflows occur), as well as to allow for rational reconstruction to be performed on the output model at the end of the protocol.This can be guaranteed by using the same plaintext ring Z  Gia as [28].However, for our selection protocol we will need the modulo  to be at least square that value, namely: where , , ℓ,  are as specified in Section 4.
Remark 5.4 (On the choice of the plaintext modulus  in SIR.).For efficiency reasons, we would like to avoid (when possible) performing computations in the ring Z  , since such computations would be heavy due to the size of  (as described above).Instead, in SIR we are able to use a smaller modulus  for some computations.We can make due with any  ≥  which can be used as a plaintext modulus in the underlying FHE scheme.
Remark 5.5 (Dimension reduction and projection).Our protocol iteratively reduces the set  of current features (i.e., ones that will be part of the output model), which is done in two steps as follows.
(1) reset the entries of , ì  that correspond to entries in [] \  , by setting to zero the rows and columns of  (the entries of ì , respectively) that are indexed by  ∉  , resulting in a matrix  ′ and a vector ì  ′ .(2) projecting  ′ , ì  ′ to  by erasing the rows and columns of  ′ (entries of ì , respectively) indexed by  ∉  .We denote this operation by pjct  (•) (this operation can be applied to a matrix or a vector), namely we compute pjct  ( ′ ) , pjct  ì  ′ .
Step ( 1) is obtained by multiplying with a nullifier matrix N  which is defined as follows: N  ∈ Z ×  is obtained from the  ×  identity matrix by resetting the diagonal entries in all rows indexed by  ∉  (we omit  from the notation, since it is clear from the context).More specifically, we set Notice that for any matrix  , multiplying by N  from the left (right, respectively) rests the rows (columns, respectively) indexed by  ∉  , and similarly when multiplying a vector ì  by N  from the left.Another operation which will be used in our protocols is an expansion from dimension  to dimension [𝑑].Specifically, we define expd  (•) such that on input an | | × | | matrix  (a length-| | vector ì , respectively) returns the  ×  matrix  ′ (length- vector ì  ′ , respectively) such that for every  ∉  the th row and column in  ′ (th entry in ì  ′ , respectively) is 0, and additionally  = pjct  ( ′ ) , ì  = pjct  (ì  ′ ).
Remark 5.6 (Unique Entries in Intermediate Models).Our security analysis will rely on the assumption that for every intermediate model ì   computed in Step 3a of the SIR protocol (Figure 2), all entries are unique (i.e., if  ≠  then ì  , ≠ ì  , ).This can be easily achieved as follows.First, when scaling the inputs in the setup phase, we incorporate log  additional "empty" least-significant bits.That is, instead of scaling an ℓ-precision real number by multiplying it by 10 ℓ , we multiply it by 10 ℓ+log  .Then, at the end of each scaled ridge regression iteration (Figure 3) we replace the log  least-significant bits of ì  , with the binary representation of  (this can be done because at this point the ciphertext is encrypted entry-by-entry).
Remark 5.7 (Emulating Boolean circuits using arithmetic circuits).Our protocols embed Boolean values into a larger ring Z  , and operate over these representations.This is done as follows. ∧  is implemented by multiplying  •  in Z  . ⊕  is implementing by computing ( − ) 2 in Z  .This perfectly emulates AND and XOR whenever ,  ∈ Z  ∩ {0, 1}.We negate a Boolean value  by computing 1 −  (where 1 is the identity of Z  ).

SYSTEM AND EMPIRICAL EVALUATION
We implemented the SIR protocol into a system and ran experiments on real data to evaluate its concrete complexity and correctness (i.e., that output models match cleartext results).Furthermore, we compare the iterated ridge approach (as securely realized in SIR) to the filter and truncation approaches for feature selection, showing it significantly outperforms the latter (see details in the full version).

Data
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal human tissue samples spanning 33 cancer types.The program integrates contributions from many researchers coming from diverse disciplines and from multiple institutions.The data spans genomic, epigenomic, transcriptomic, and proteomic data measured on the aforementioned samples.We used a small portion of this data for our experiments.Concretely we used randomly selected portions of one of the breast cancer transcriptomics data matrices.We start with a matrix with features normalized to lie in [−1, 1] with 3-digit precision.The matrix has 781 rows (samples/instances) and > 10 columns (features, representing genes Public parameter: an FHE scheme E = (KeyGen, Enc, Dec, Eval), a security parameter , a regularization parameter , dimensions  ×  of the input matrix, a plaintext ring size  satisfying the requirements of Remark 5.3, the number of data owners , positive input sizes  1 ,  2 , . . .,   > 0 such that   =1   = , a sparsity parameter  < , a precision parameter ℓ, a sparsity parameter rej which determines the fraction of features that are removed in each iteration, and a threshold parameter thr which determines when the protocol starts to remove a single feature in each iteration.Additionally, let  ≥  (see Remark 5.4) Output: all parties obtain as output an -sparse model ì  ∈ R  .Steps: (1) Setup: The parties execute the setup protocol (described in the full version [2]) to obtain keys pk  , sk  and pk  , sk  .Then, the parties execute the data uploading, merging and permuting protocols (described in the full version), and S 1 obtains encryptions of the aggregated input matrix and response vector , ì .(Intuitively, when | | > thr the set of current features is still sufficiently large that we can remove a subset of features in each iteration.)• Small-set case: Otherwise (i.e., | | ≤ thr), set  ′ = 1.(In this case, the set of current features is small, so features should be removed one at a time.)• S 1 and S 2 execute the selection protocol (see the full version [2]) to find the smallest  ′ features.S 1 has input ì z  , S 2 has input sk  , and both parties have input pk  , pk  , ,  ′ .The output of both servers is the set  del consisting of the  ′ features to be removed.
) S 1 sets ì z to be its output for the phase.support biological and clinical interpretation.TIL levels quantify tumor infiltrating immune cells in the (tumor) samples.
To run an experiment with  features, that is adequate for our current running time complexity, we chose to work with 4, 10, 40 and 100 features.To generate a dataset for a given , we randomly (uniformly) selected  features (columns of D) to be the columns of the data matrix  .We then set the target vector ì  to be the vector of TIL levels.To support bias, we add an extra column of 1's to  .When relevant, we evenly distributed the 781 samples between the data owners.As described in the protocol, each data owner computed  = ⌊10 ℓ    ⌉ and ì  = ⌊10 ℓ   ì ⌉, which also scaled and rounded the values.Note that in our experiments on  features,  and ì  are therefore of dimensions ( + 1) × ( + 1) and ( + 1), respectively.Solving for ì  continues in the same way as in Protocol 2 (computing adjugate and determinant of a ( +1) × ( +1) matrix), but computing the ranks of the features involves only  features because the bias is not treated as a selectable feature.

System Implementation Details
Ring size.To obtain correctness in SIR, the ring sizes needed for our experiments exceeded the sizes currently supported by FHE libraries.For example, for inputs with 40 features, we require a ring of size 1,260 bits, while current libraries support rings of size less than 64 bits.To support such large ring sizes, we encoded the input using the Chinese Remainder Theorem (CRT) with several distinct 30-bit primes, as was used in [4].
Permuting the input.In a preliminary step S 1 and S 2 participate in a protocol to permute the input: , where  1 and  2 are random permutation matrices chosen by S 1 and S 2 , respectively.There are several ways to compute this step.One option is for S 1 to perform all the computations using FHE (with multiplicative depth 2).Another option -which is the one used in our implementation -is to utilize a protocol between S 1 and S 2 that uses masking and requires only additive homomorphism.This step is executed only once at the onset of the protocol, and its running time is inconsequential compared to the total protocol runtime (e.g., 1,152 out of 84,690 seconds on 40 features, see Table 1).
Ranking weights.To decide which entries of ì  to remove in each iteration, we compute a full ranking over the entries.This involves comparing pairs of entries.Our experiments show that the majority of the running time is spent in this step.For example, the total real time of SIR when  = 40 was 84,690 seconds, while computing the ranks took 76,184 seconds (see Table 1).To Compare a single pair   ,   of entries, S 1 compares  2  −  2  +    to ranges received from S 2 , to determine whether  2  −  2  ∈ [0,  /2] (and hence   ≥   ).In our implementation S 1 performed all  2 comparisons.An alternative is to use an approach similar to Bitonic sort (or Batcher sort, see [5]).However, while a sorting approach performs only  2 ⌈log 2  ⌉ 2 comparisons, it is less amenable to parallelization.Specifically, the sorting approach is faster when , so for the number of features and CPUs in our experiments the naive all-pairs approach was faster.We stress that to utilize the Bitonic sort alternative (which would be faster for larger 's), one only needs to implement the comparison step using Bitonic sort, making our protocol flexible, efficiently supporting both regimes of .Cryptographic Libraries.At a high level, the steps of our protocol can be categorized into two types with different characteristics: • Ranking steps.Computing the ranks of features in ì , when the entries are given in binary.These steps involve a subcircuit of large depth for sorting  large numbers in binary representation, e.g.1,260-bit numbers when  = 40.This requires a key that supports large-depth circuits, and possibly also bootstrapping.We note that the plaintext modulo needed by these steps is relatively small.• CRT steps consist of all other steps, and operate over numbers that are represented using CRT.The CRT steps of the protocol require only additive homomorphism, but necessitate multiple (co-prime) plaintext moduli to implement the CRT.Preferably, these plaintext moduli should be as large as possible, because larger moduli mean less elements in the CRT encoding.We used two different FHE libraries (with different schemes) to implement these two types of steps.For ranking we use BGV in HElib [38] 2.1.0,because it supports bootstrapping (unlike SEAL).For CRT steps we use B/FV in SEAL [54] 4.0, in which it is easier to configure keys for multiple plaintext moduli.Switching between the two libraries and schemes is done by S 2 , who generated the keys (  ,   ) and (  ,   ) in HElib and SEAL respectively.In detail, in HElib [38] 2.1.0we initialize the keys while setting the plaintext modulo to 17 2 , and the cyclotomic polynomial degree to 78,881.This resulted in ciphertexts with 7,000 slots, supporting a multiplicative depth of 28, as well as bootstrapping.In SEAL we initialize the keys while setting the cyclotomic polynomial degree to 8,192, a chain length of 7, chain bits of 25 and prime bits of 30.This resulted in ciphertexts with 4,096 slots that support additive homomorphism.

Evaluation Setup
We executed SIR (Protocol 2) on the data generated from TCGA described in Section 6.1, and measured performance of the data owners and servers.We executed two types of experiments: endto-end and single-iteration experiments.In all experiments rej was set to 10% and the ring size  was determined by Equation 4. End-to-end experiments measure performance when executing the entire SIR protocol, including the data encoding and all iterations, until producing a sparse model that selects  out of the  initial features from a dataset of  records distributed amogst  data owners.We ran end-to-end experiments on the following , , ,  parameters: (1) TCGA data with number of features (, ) varying between (4, 2), (10,4), (40,8)  Single-iteration experiments measure performance when executing a single iteration (Protocol 2, Step 3, i.e., computing scaled ridge, ranking the features, and removing the features with the smallest merge the data, that indeed increased by ten-fold (from 10 to 102 seconds); however this has only a minor influence on the overall runtime (which is dominated by the 8048 seconds to rank features), and so a ten-fold increase in the number of data owners led to an increase in the total runtime by roughly 1%.
RAM.The RAM usage of our system was significantly lower than the allocated RAM, ranging from 21GB to 134GB.Furthermore, our experiments indicate that the RAM requirement grows sub linearly in the measured values for  (e.g. from 43GB when  = 10 to 134GB when  = 40).This is because our ciphertexts had more slots than needed for our encodings (for small 's), so the total number of ciphertexts grew only mildly in .
Parallelization.As  grows, most tasks become more parallelizeable.
In particular, the ranking task -which dominates the bulk of the runtime-is "embarrassingly" parallelizable since we make all  2 comparisons, which can be executed in parallel.We note that if we had an unlimited number of CPUs, the ranking runtime would essentially be the time of a single comparison.For example, when (,  ) = (40, 2 1260 ) (cf.Table 2 bottom row), our ranking utilized an average of 91 CPUs (the system had 100 CPUs), whereas having access to   2 CPUs is expected to improved the ranking runtime by a factor of 40  2 /91 ≈ 8.6, thus improving the total runtime by 6×.We remark that an alternative way to compute the ranking would be to homomorphically evaluate an oblivious sorting algorithm (e.g., bitonic sort).Bitonic sort would require less comparisons - ( log 2 ); but this type of sorting admits a circuit structure having  (log 2 ) layers, which is less amenable to parallelization.For 40 features and 100 CPUs, this alternative would have higher runtime.

CONCLUSIONS
We present a privacy-preserving multi-party protocol for running sparse linear regression in a federated learning setup, based on an iterated ridge framework.Our protocol enjoys rigorous security, and scales favorably with the number of records and data owners.Moreover, our protocol naturally gives a privacy-preserving ridge truncation protocol, which is less accurate (as we have shown), but simpler and faster, and therefore may be preferred in some cases.
The design of our protocol is based, amongst other consideration, on certain potential attacks that can be developed when partial or intermediate information is leaked.In particular, we show that revealing the order in which features are removed can be used to infer non-trivial information about the input data.We also extend this attack to other leakages such as revealing intermediate models or the determinant of the intermediate  matrices.SIR is susceptible to model inversion attacks (cf.Section 2.5); determining the exact extent to which such attacks are harmful, and devising measures to protect against them, are left for future work.
We mostly focus on the case  ≤ .In particular, our security proof addresses this case and furthermore requires the matrix  to be invertible.Our protocol can be adjusted for settings with  >  by combining SIR with a faster learning method (e.g., filter) to first partially reduce the number of features.One can similarly perform a preparatory step to handle the case in which  is not full rank.

Figure 1 :
Figure1: Iterated Ridge (IR).Notations: (, , Ω) denotes the solution of the linear regression system given by (, ) when using only features in Ω; nnz( ì ) denotes the number of non-zero elements in ì ; the function argsort sorts the indices of an array according to the values it contains; abs ( ì ) returns the absolute value of each entry of ì .The regularization parameter  is implicit in this pseudo-code.
(c) Compacting data for the next iteration: Let  * =  \  del .Then S 1 locally executes the compacting algorithm (described in the full version [2]), with input  * , pk  , A and ì b (encrypting  and ì , respectively).The output of S 1 are updated (compacted) A  * , ì b  * .(d) set  =  * .(4) Output: Parties execute the computing and unpermuting output protocol (described in the full version [2]) to obtain the -sparse model ì .

: At the onset of this phase, S 1 initializes the set 𝐹 of surviving features to be all
features, i.e.,  = [].Denote by   | and ì   | the restriction of   and ì   to the surviving features  (i.e., include only rows of   and ì   whose indices are in  , and likewise for columns of   ).While the number of features in  exceeds the  0 constraint, features are removed as follows: 2: Revealing the model only to one designated party, denoted O, who may be an external party participating only in Phase IV, as follows.S 1 sends the (encrypted and un-permuted) model to O (instead of S S 2 are powerful servers to which computation is outsourced (say, Amazon AWS and Google Cloud) and where O is some health organization authorized to receive the computed model (say, the World Health Organization).Alternative 3: Outputting the model only in encrypted form.In this case S 1 simply publishes the encrypted model, instead of sending it to S 2 for decryption.The model in this case is specified by a vector of  ciphertexts whose plaintext values are in Z  and where at most  of them are non-zero (in order to hide both the indices of the selected features and their weight).Outputting an encrypted model is motivated by scenarios where the encrypted model is employed for privacy preserving inference using homomorphic computation, as follows.Given the encrypted model and a (possibly, encrypted) sample 2 ); O homomorphically masks the model with a random additive mask, and sends the encrypted masked model to S 2 who decrypts and returns the (cleartext) masked model to O; finally, O removes the masking and computes rational reconstruction to obtain the output model.This is motivated by scenarios where S 1 , . Inputs: For every  ∈ [], the input of data owner DO  consists of a data matrix   ∈ R   × and a response vector ì   ∈ R   .We denote by ( | ì ) the combined input obtained from all   , ì   .That is, [] is partitioned into  subsets  1 , . . .,   ⊆ {1, . . .,  }, and   , ì   is the restriction of  , ì  to the rows in   .(Here,  and ì  are scaled to lie in Z, and then embedded in Z  for a sufficiently large  , see Remark 5.3.)The servers S 1 , S 2 have no input.