Kanav Gupta*, Deepak Kumaraswamy, Nishanth Chandran, and Divya Gupta

LLAMA: A Low Latency Math Library for Secure Inference

Abstract: Secure machine learning (ML) inference can provide meaningful privacy guarantees to both the client (holding sensitive input) and the server (holding sensitive weights of the ML model) while realizing inference-as-a-service. Although many specialized protocols exist for this task, including those in the preprocessing model (where a majority of the overheads are moved to an input independent offline phase), they all still suffer from large online complexity. Specifically, the protocol phase that executes once the parties know their inputs, has high communication, round complexity, and latency. Function Secret Sharing (FSS) based techniques offer an attractive solution to this in the trusted dealer model (where a dealer provides input independent correlated randomness to both parties), and 2PC protocols obtained based on these techniques have a very lightweight online phase. Unfortunately, current FSS-based 2PC works (AriaNN, PoPETS 2022; Boyle et al. Eurocrypt 2021; Boyle et al. TCC 2019) fall short of providing a complete solution to secure inference. First, they lack support for math functions (e.g., sigmoid, and reciprocal square root) and hence, are insufficient for a large class of inference algorithms (e.g. recurrent neural networks). Second, they restrict all values in the computation to be of the same bitwidth and this prevents them from benefiting from efficient float-to-fixed converters such as Tensorflow Lite that crucially use low bitwidth representations and mixed bitwidth arithmetic. In this work, we present LLAMA – an end-to-end, FSS based, secure inference library supporting precise low bitwidth computations (required by converters) as well as provably precise math functions; thus, overcoming all the drawbacks listed above. We perform an extensive evaluation of LLAMA and show that when compared with non-FSS based libraries supporting mixed bitwidth arithmetic and math functions (SIRNN, IEEE S&P 2021), it has at least an order of magnitude lower communication, rounds, and runtimes. We integrate LLAMA with the EzPC framework (IEEE EuroS&P 2019) and demonstrate its robustness by evaluating it on large benchmarks (such as ResNet-50 on the ImageNet dataset) as well as on benchmarks considered in AriaNN – here too LLAMA outperforms prior work.

Keywords: Function Secret Sharing, Secure Inference, Secure Two-Party Computation

DOI 10.2478/popets-2022-0109

Received 2022-02-28; revised 2022-06-15; accepted 2022-06-16.

1 Introduction

One of the most important security problems in the area of machine learning is that of secure inference, wherein, a model owner $S$, holding a public model $M$ and its associated private weights $w$, offers inference-as-a-service to a client $C$, with private input data point $x$, with the security guarantee that $S$ learns nothing about $x$, while $C$ only learns $M(w,x)$, i.e. the output of the model on the input data, and nothing else. Being a special case of the generic problem of secure multi-party computation (MPC or 2PC in the 2-party case) [15, 41], this problem has received widespread attention – both within the cryptography community [19, 20, 28, 29, 31] as well as in several application domains such as healthcare [21, 36].

On the technical side, perhaps the biggest bottleneck towards the widespread adoption of cryptographically secure inference protocols, is their large communication and interaction cost. As an example, even if one were to execute secure inference on a simple 7-layer convolutional neural network (CNN) [27] on the CIFAR-10 dataset using a state-of-the-art system [31], one inference would cost approximately 340 MB of communication, 345 rounds of interaction, and would take about 10 seconds to execute on commodity hardware with good bandwidth. In light of this, works such as DELPHI [28] have crucially focused on reducing the online cost of such secure protocols – that is, where a bulk...
of the overheads can be moved to an input independent pre-processing phase. However, even here, the situation is far from satisfactory and the above benchmark would still require 196 MB of online communication and 2.7 seconds of online execution time. Further, inference algorithms that contain mathematical functions such as sigmoid and tanh are more expensive to compute securely; e.g. running a recurrent neural network (RNN) [26] on the standard Google-30 dataset [40] to identify commands, directions, and digits from speech costs 415 MB, \(\approx 60,000\) rounds, and \(\approx 37\) seconds [30].

1.1 Function Secret Sharing Based 2PC

A recently emerging trend to build secure computation protocols with low online cost is through the technique of function secret sharing (FSS) [5, 7]. In this paradigm, a trusted dealer provides input independent correlated randomness to the two parties executing the secure computation protocol. Here, round and communication optimal 2PC protocols are known for many functions such as addition, multiplication, comparison, and so on; thus, leading to 2PC protocols with low online cost for various algorithms [3, 7, 34]. This paradigm, which leads to dramatic improvements in online cost, is practically well motivated too [2, 4, 10, 21, 22] – typically the vendor providing or building the 2PC solution for the client and server in our above example is trusted to provide correct 2PC code and hence providing correlated randomness is not an additional burden.

While FSS-based 2PC protocols for secure computation [7], fixed-point arithmetic [3], and secure inference [34] are known, these works suffer from the following drawbacks.

Lack of Math Functions Support. First, they lack support for math functions (e.g. sigmoid, tanh, reciprocal square root) and thus are incapable of handling benchmarks such as the RNN described above\(^1\). Recent work [30] provides a secure computation library for precisely computing various math functions by designing approximate functionalities that simultaneously have low error and are also efficient to compute securely in 2PC. However, unfortunately, the protocols used to compute these functionalities require high online communication and rounds (due to the highly sequential nature of the functionalities themselves) and hence, as we note later, the online cost of computing them is still prohibitive. Furthermore, given this, even the functionalities themselves are not the optimal design choice to be computed securely using FSS techniques.

Lack of Support for Mixed Bitwidth Arithmetic.

Second, most existing MPC frameworks support only uniform bitwidth arithmetic – i.e., a single bitwidth and scale are used uniformly for all fixed-point values and all secure computation must execute within these limits. On the other hand, state-of-the-art converters such as Tensorflow Lite [16] and Shiftry [24], that quantize floating point models to corresponding fixed point ones, for efficiency, use low bitwidth representations (e.g. 8 or 16). To obtain precise outputs of operators such as simple matrix multiplication with low bitwidth inputs and outputs, these libraries rely on using higher bitwidths to compute intermediate results such as element-wise multiplications and accumulations in a dot product to avoid overflows. Later, these intermediate results are quantized to the right output format. However, without the support for such precise computation on low bitwidths (which implicitly require mixed bitwidth arithmetic), systems such as [9] were forced to perform secure computation over large bitwidths, resulting in high performance overhead. The work of [30] was the first to provide 2PC protocols for such precise low bitwidth arithmetic, but these protocols have high online interaction and communication.

Erroneous ReLU and Truncations. Finally, the only prior FSS-based secure inference library Arianna [34], uses protocols for ReLU\(^2\) and truncations of fixed-point values (required to maintain scale) that are only probabilistically correct. This has two drawbacks – first, since this error probability is inversely proportional to the bitwidth used, the protocols are forced to use larger bitwidths to ensure that this error probability does not affect the accuracy of the resulting inference algorithm; second, as the number of instances of ReLU/truncation in the algorithm increases, the probability of the overall computation being incorrect increases and hence, as noted in multiple works [9, 19, 25], on large benchmarks, such probabilistic computations would almost certainly provide incorrect outputs.

---

1 Recent work of [37], which focuses on application of FSS to recommendation systems, considered only reciprocal square root. We provide a comparison with this work in Section 1.3.

2 The ReLU activation is defined as \(\text{ReLU}(x) = \max(x, 0)\).
1.2 Our Contributions

In this work, we address the above drawbacks and present LLAMA – an end-to-end semi-honest secure 2PC inference system (in the trusted dealer model) that is based on FSS techniques. LLAMA has significantly lower online complexity compared to previous works on 2-party secure inference [28, 30, 31] and is much more expressive than AriaNN [34], the only prior secure inference system based on FSS. We discuss our technical contributions and features of LLAMA below.

Precise Low Bitwidth Computation. First, LLAMA supports precise secure computation over low bitwidth values. As discussed, this requires computing intermediary results over higher bitwidths and then appropriately reducing these results to lower bitwidths. Supporting this requires providing FSS-based protocols for two crucial functionalities – Zero/Signed-Extension (to increase bitwidth) as well as Truncate-Reduce (to decrease both bitwidth and scale). We design a single-round FSS protocol, also known as an FSS gate, for these operations that act as building blocks in our other protocols (Section 3). Next, we build on these and design protocols for element-wise multiplications and matrix multiplications (and convolutions) that implement the arithmetic logic used in fixed-point implementations of converters [16, 24] (Section 4).

Low Bitwidth Splines and Math Functions. Second, we provide “FSS-friendly” low bitwidth approximations to math functions (such as sigmoid, tanh and reciprocal square root) that are provably precise. We use ULPs as the measure of preciseness of math implementations that is also used by standard math libraries and ensure that our implementations have at most 4 ULPs error (similar to math libraries and 2PC work IRNN [30]). As already mentioned, approximate functionalities provided in IRNN are highly sequential and would lead to large number of rounds in the online phase even when implemented with FSS-based techniques. Hence, we deviate significantly from IRNN in our design choice and instead use low bitwidth piecewise polynomials or splines to approximate our math functions. However, standard tools for finding splines result in floating-point splines. We convert these to fixed-point splines keeping FSS costs in mind. Next, once we have such a spline, as discussed in Section 5, we need to evaluate it efficiently using FSS techniques. Here, we build upon the work of [3] that provided an FSS protocol for uniform bitwidth spline evaluation and extend it to protocols for low bitwidth splines. Further, LLAMA uses two novel optimizations over [3] that significantly reduce keysize as well as PRG invocations during the online phase of that protocol. As an example, for the spline approximating sigmoid, our techniques reduce keysize by 4× and PRG invocations by 40×.

ReLU, Truncation, Pooling. Third, unlike [34], LLAMA uses correct protocols for comparison and ReLU from [3]. We build on these to provide correct protocols for average pool, maxpool, and argmax (Appendix C). Next, unlike [34], in LLAMA all types of truncations (bitwidth preserving or bitwidth reducing) and comparisons are faithful, resulting in correct computation. With this, our FSS implementations are bitwise equivalent to the corresponding cleartext fixed-point implementations. This enables us to execute large benchmarks with no fear of incorrect outputs (even when using very small bitwidths in some cases).

E2E Inference System. Fourth, we integrate LLAMA as a cryptographic backend to the EzPC framework [8]. Together, all of the above, enable us to execute various benchmarks securely – large CNNs (such as ResNet-50 on ImageNet dataset) using the CrypTFlow toolchain [25] as well as RNNs (e.g. [26] on the Google-30 dataset) using the IRNN toolchain [30]. For almost all our benchmarks, we obtain at least an order of magnitude reduction in online communication, rounds and runtimes (Section 6), thus obtaining, a low latency framework for secure inference.

Let us revisit the two examples presented earlier in the introduction. Running LLAMA on the RNN [26] over the Google-30 dataset costs only roughly 8.6 MB of online communication, 2600 online rounds, and 1.9 seconds, resulting in roughly 48×, 22×, and 19× improvement in communication, rounds and performance over IRNN [30]. Similarly, executing the CNN model [27]

---

3 Informally, ULP (unit of least precision) is the number of representable values between our result and output over reals. It is widely accepted as the standard for measuring accuracy in numeric calculations [14]. See Section 5 for more details.

4 We note that although prior works such as SecureML [29] also used a spline to approximate sigmoid, the spline used had only 3 pieces and leads to very high ULP error and hence, a high degradation in classification accuracy as shown in [30].
on the CIFAR-10 dataset costs 8.25 MB of online communication, and 0.5 seconds resulting in approximately $24 \times 5 \times$ improvements in online communication and times over DELPHI [28]. We now proceed to introduce all background technical information in the next section.

1.3 Other Related Works

The work on FSS-based 2PC protocols for fixed-point arithmetic by Boyle et al. [3] provides FSS gates for various building blocks such as ReLU, arithmetic/logical right shift, and splines. However, [3] lacks support for FSS-gates of signed-extension, truncate-reduce and right shift, and splines. However, [3] lacks support for a given bitwidth of $n$.

However, [3] lacks support for FSS-gates of signed-extension, truncate-reduce and right shift, and splines. Hence, does not support mixed-bitwidth operations and precise math functions over small bitwidths.

A recent work by Vadapalli et al. [37] also use spline to compute reciprocal square root and realize it using DPF (Distributed Point Function) [6]. Our work and [37] present a trade-off between online computation and key size. The online compute in LLAMA grows proportional to the number of intervals of the spline. On the other hand, the online compute of [37] grows exponentially with the input bitwidth. (For reciprocal square root with 16-bit inputs, in the online phase, LLAMA makes 1448 AES evaluations, and [37] makes 131072 AES evaluations.) However, the key size in LLAMA is higher than that in [37]. (For reciprocal square root with 16-bit inputs, key size in LLAMA is nearly 5KB, compared to around 0.3KB in [37].) Since the primary benefit of FSS based techniques is to reduce online complexity, LLAMA would perform better than [37].

2 Preliminaries

Notation. Let $\lambda$ denote the computational security parameter. We use uppercase $L, M, N, \ell$ to denote the values $2^L, 2^M$ and $2^\ell$ respectively. $\mathbb{1}\{b\}$ denotes the indicator function that is $1$ when $b$ is true and $0$ otherwise. We use the natural one to one map between $\{0, 1\}^\ell$ and $\mathbb{Z}_L$.

We consider computations over finite bit unsigned and signed integers, denoted by $\mathbb{U}_N$ and $\mathbb{S}_N$, respectively, for a given bitwidth of $n$-bits. We note that $\mathbb{U}_N = \{0, \ldots , N - 1\}$ is isomorphic to $\mathbb{Z}_N$. Signed integers $\mathbb{S}_N$ range from $-N/2$ till $N/2 - 1$ and can be encoded into $\mathbb{Z}_N$ or $\mathbb{U}_N$ using 2’s complement representation. In this encoding, the MSB of the bit-representation of $x \in \mathbb{S}_N$ is 0 if $x \geq 0$ and 1 otherwise. We use the functions $\text{uint}_n : \mathbb{U}_N \rightarrow \mathbb{Z}$ and $\text{sint}_n : \mathbb{S}_N \rightarrow \mathbb{Z}$ to represent the conversion of a number to its unsigned and signed number in $\mathbb{Z}$ respectively. In 2’s complement notation, $\text{sint}_n(x) = \text{uint}_n(x) - \text{MSB}(x) \cdot N$ where $\text{MSB}(x) = \mathbb{1}\{x \geq 2^{n-1}\}$. We drop the subscript whenever the bitwidth can be inferred from the context. We also use the symbol $\ast_\ell : \mathbb{Z} \times \mathbb{Z} \rightarrow \mathbb{Z}_L$ to denote the binary operation $x \ast_\ell y = x \cdot y \mod L$ where $x, y \in \mathbb{Z}$. For $x \in \mathbb{S}_N$, we use the notation of $x[i]$ to represent the $i$-th bit from the LSB in the 2’s complement representation of $x$ such that $\text{MSB}(x) = x[0]$ and $\text{MSB}(x) = x[n - 1]$. We also use $x[i:j], j > i \in \mathbb{Z}_2^{j-i}$ (where $j > i$) to denote the number formed by the bitstring $x[j-1], x[j-2] \ldots x[i]$.

Fixed-point representation. Real numbers are encoded into $\mathbb{Z}_N$ using fixed-point notation. The fixed-point numbers are parameterized by two values, a bitwidth $n$ and a scale $s$. The first $n-s$ bits and last $s$ bits correspond to the integer part and the fractional part respectively. We have $y = \text{fix}_{n,s}(x) = \lfloor x \cdot 2^s \rfloor \mod N$. To convert a fixed-point integer $y$ to its real counterpart $x$, we have $x = \text{urts}_{n,s}(y) = \text{uint}(y)/2^s$ if $y \in \mathbb{U}_N$ and $x = \text{srt}_{n,s}(y) = \text{sint}(y)/2^s$ if $y \in \mathbb{S}_N$.

2.1 Function Secret Sharing (FSS)

An FSS scheme [5, 6] is a pair of algorithms, namely Gen and Eval. Gen splits a secret function $f : \mathbb{G}^\text{in} \rightarrow \mathbb{G}^\text{out}$ into a pair of functions $f_0$ and $f_1$. For the party identifier $\sigma \in \{0, 1\}$, Eval evaluates the function $f_\sigma$ on a given input $x \in \mathbb{G}^\text{in}$. While correctness of an FSS scheme requires that $f_0(x) + f_1(x) = f(x)$ for all $x \in \mathbb{G}^\text{in}$, the security requires that each $f_\sigma$ hides $f$.

Definition 1 (FSS: Syntax [5, 6]). A (2-party) function secret sharing scheme is a pair of algorithms (Gen, Eval) such that:

- Gen($1^\lambda, \hat{f}$) is a PPT key generation algorithm that given $1^\lambda$ and $\hat{f} \in \{0, 1\}^*$ (description of a function $f$) outputs a pair of keys $(k_0, k_1)$. We assume that $\hat{f}$ explicitly contains descriptions of input and output groups $\mathbb{G}^\text{in}, \mathbb{G}^\text{out}$.

- Eval($\sigma, k_\sigma, x$) is a polynomial-time evaluation algorithm that given $\sigma \in \{0, 1\}$ (party index), $k_\sigma$ (key defining $f_\sigma : \mathbb{G}^\text{in} \rightarrow \mathbb{G}^\text{out}$) and $x \in \mathbb{G}^\text{in}$ (input for $f_\sigma$) outputs $y_\sigma \in \mathbb{G}^\text{out}$ (the value of $f_\sigma(x)$).

Definition 2 (FSS: Correctness and Security [5, 6]). Let $\mathcal{F} = \{f\}$ be a function family and Leak be a function specifying the allowable leakage about $f$. When Leak is omitted, it is understood to output only $\mathbb{G}^\text{in}$ and $\mathbb{G}^\text{out}$. We say that (Gen, Eval) as in Definition 1 is an FSS...
scheme for $F$ (with respect to leakage $\text{Leak}$) if it satisfies the following requirements.

- **Correctness:** For all $\hat{f} \in \mathcal{F}$ describing $f : G^\text{in} \rightarrow G^\text{out}$, and every $x \in G^\text{in}$, if $(k_0, k_1) \leftarrow \text{Gen}(1^\lambda, \hat{f})$ then $\Pr[\text{Eval}(0, k_0, x) + \text{Eval}(1, k_1, x) = f(x)] = 1$.

- **Security:** For each $\sigma \in \{0, 1\}$ there is a PPT algorithm $\text{Sim}_\sigma$ (simulator), such that for every sequence $(\hat{f}_x)_{x \in \mathbb{N}}$ of polynomial-size function descriptions from $\mathcal{F}$ and polynomial-size input sequence $x$, for $f_x$, the outputs of the following experiments $\text{Real}$ and $\text{Ideal}$ are computationally indistinguishable:

  - $\text{Real}_x : (k_0, k_1) \leftarrow \text{Gen}(1^\lambda, \hat{f}_x)$; Output $k_\sigma$.
  - $\text{Ideal}_x : \text{Output} \, \text{Sim}_\sigma(1^\lambda, \text{Leak}(\hat{f}_x))$.

### 2.2 2PC with Preprocessing via FSS

**Threat Model.** We consider 2PC with a trusted dealer secure against a static PPT adversary that corrupts one of the parties. That is, the adversary is computationally bounded, corrupts one of the parties at the beginning of the protocol, and follows the protocol specification. The dealer gives out the input independent correlated randomness to both the parties in the offline phase. Given the correlated randomness, the parties engage in a 2PC protocol in the online phase. We consider standard simulation paradigm for semi-honest security.

Boyle et al. [7] construct 2PC protocols (in a trusted dealer model, where the dealer provides correlated randomness to the 2 parties\(^5\)) via FSS. The high level idea is as follows: For each wire $w_i$ in the circuit to be computed, the dealer picks a mask value $r_i$ uniformly at random. Denote the cleartext value at $w_i$ by $x_i$. The protocol maintains the invariant that the 2 parties hold masked values to input wire of a gate, and run FSS evaluation protocol to learn masked value of the output wire with one round of simultaneous message exchange. In more detail, to compute a gate $g_{ij}$ with input and output wires $w_i$ and $w_j$, parties start with $x_i + r_i$ and end up with $x_j + r_j$ after one round of interaction. Moreover, the size of the message exchanged is the same as the bitwidth of $x_j$. To enable this, the dealer gives out FSS keys for the offset function $g_{ij}^{[r_i, r_{\text{out}}]}(x) = g_{ij}(x - r_i) + r_j$ in the pre-processing phase. In the online phase, the parties compute their share of the function on the masked input $x_i + r_i$ to obtain the secret shares of the value $x_j + r_j$, which they reconstruct to obtain the masked output value. For the input wires, dealer simply sends the masks of the wire to its respective owner, who on receiving the mask, adds it to the input and sends it to the other party. For the output wire, the dealer sends the mask to both of the parties. For more details on the construction of 2PC using FSS, we refer the reader to [7]. Now we formally define FSS gates:

**Definition 3 (FSS Gates [3]).** Let $\mathcal{G} = \{g : G^\text{in} \rightarrow G^\text{out}\}$ be a computation gate (parameterized by input and output groups $G^\text{in}, G^\text{out}$). The family of offset functions $\hat{\mathcal{G}}$ of $\mathcal{G}$ is given by

$$\hat{\mathcal{G}} := \left\{ g^{[r_i, r_{\text{out}}]} : G^\text{in} \rightarrow G^\text{out} \mid g : G^\text{in} \rightarrow G^\text{out} \in \mathcal{G}, \ r_i \in G^\text{in}, \ r_{\text{out}} \in G^\text{out} \right\}$$

where $g^{[r_i, r_{\text{out}}]}(x) := g(x - r_i) + r_{\text{out}}$, and $g^{[r_i, r_{\text{out}}]}$ contains an explicit description of $r_i, r_{\text{out}}$. Finally, we use the term FSS gate for $\mathcal{G}$ to denote an FSS scheme for the corresponding offset family $\hat{\mathcal{G}}$.

### 2.3 Prior FSS Schemes and FSS Gates

In this section, we discuss the FSS schemes and FSS gates constructed in prior works [3, 6] that serve as building blocks for us. A distributed comparison function is an FSS scheme for special intervals as defined below.

**Definition 4 (DCF [3, 6]).** A special interval function $f_{\alpha, \beta}^{\leq}$, also referred to as a comparison function, outputs $\beta$ if $x < \alpha$ and 0 otherwise. We refer to an FSS schemes for comparison functions as distributed comparison function (DCF). Analogously, function $f_{\alpha, \beta}^{<}$ outputs $\beta$ if $x \leq \alpha$ and 0 otherwise. In all of these cases, we allow the default leakage $\text{Leak}(\hat{f}) = (G^\text{in}, G^\text{out})$.

For $\alpha \in \{0, 1\}^n$ and $\beta \in \mathbb{G}$, we use $\text{Gen}^\leq_n(1^\lambda, \alpha, \beta, \mathbb{G})$ and $\text{Eval}^\leq_n(b, k_0, x)$ to denote the keygen and evaluation algorithms for DCF.

**Theorem 1 (Concrete cost of DCF [3]).** Let $\lambda$ be the security parameter, $\mathbb{G}$ be an Abelian group, and $\ell = \lceil \log |\mathbb{G}| \rceil$. Given a PRG $G : \{0, 1\}^\lambda \rightarrow \{0, 1\}^{4\lambda + 2}$, there exists a DCF for $f_{\alpha, \beta}^{\leq} : \{0, 1\}^n \rightarrow \mathbb{G}$ with key size $n(\lambda + \ell + 2) + \lambda + \ell$ bits. For $\ell' = \left\lceil \frac{\ell}{4\lambda + 2} \right\rceil$, the key generation algorithm $\text{Gen}^\leq_n$ invokes $G$ at most $2n(1 + 2\ell') + 2\ell'$ times and the evaluation algorithm $\text{Eval}^\leq_n$ invokes $G$ at most $n(1 + \ell') + \ell'$ times. In the special case that $|\mathbb{G}| = 2^c$ for $c \leq \lambda$ the number of PRG invocations in $\text{Gen}^\leq_n$ is $2n$ and the number of PRG invocations in $\text{Eval}^\leq_n$ is $n$.

\(^5\) [3] discuss how the pre-processing FSS keys can be generated using standard 2PC, but in this work we focus on 2PC with a trusted dealer setting.
Dual Distributed Comparison Function (DDCF) is a generalization of DCF also introduced in [3]. It is defined as a class of comparison functions $f_{\alpha, \beta_1, \beta_2}^{DCF}: \{0, 1\}^n \rightarrow \mathbb{G}$ which returns $\beta_1$ if the input value is less than $\alpha$ and $\beta_2$ otherwise. DDCF can be described in terms of DCF by $f_{\alpha, \beta_1, \beta_2}^{DCF}(x) = \beta_2 + f_{\alpha, \beta_1, 1-\beta_2}^{DCF}(x)$. We use $\text{Gen}_{n}^{DCF}(1^\lambda, \alpha, \beta_1, \beta_2, \mathbb{G})$ and $\text{Eval}_{n}^{DCF}(b, k, x)$ to denote FSS algorithms for DDCF.

We abuse notation and use $\text{DCF}_{n, \mathbb{G}}$ and $\text{DDCF}_{n, \mathbb{G}}$ to denote cost (either keysize or evaluation cost) of FSS schemes for DCF and DDCF respectively.

FSS Gates. [3, 7] construct FSS gates for uniform bitwidth matrix multiplication, interval containment, signed comparison and splines. We summarize their costs in Table 1.

3 Bitwidth-Changing Gates

Mixed-bitwidth arithmetic (multiplication, spline evaluations and so on) fundamentally relies on two building blocks – Extension, that increases the bitwidth of an input from $m$ to $n$; and Truncate-Reduce, that truncates and reduces the bitwidth of the input from $n$ to $n-s$. In this section, we present FSS gates for Extension (in Section 3.1) and Truncate-Reduce (in Section 3.2).

3.1 Extension

Zero and Signed-Extension functions are used to extend the bitwidths of unsigned and signed numbers, respectively. More precisely, for an $m$-bit number $x \in \mathbb{U}_M$ (resp. $x \in \mathbb{S}_M$), Zero-Extension (resp. Signed-Extension) to $n$-bits ($n > m$) is defined by $y = \text{ZExt}(x, m, n) \in \mathbb{U}_N$ (resp. $y = \text{SExt}(x, m, n) \in \mathbb{S}_N$), such that $\text{uint}_n(y) = \text{uint}_m(x)$ (resp. $\text{int}_n(y) = \text{int}_m(x)$) holds. In the discussion that follows, we only consider the case of Signed-Extension and note that the protocol for Zero-Extension can be derived similarly.

The Signed-Extension gate $G_{\text{SExt}}$ is the family of functions $g_{\text{SExt}, m, n} : \mathbb{S}_M \rightarrow \mathbb{S}_N$ parameterized by input group $\mathbb{G}^n = \mathbb{S}_M$ and output group $\mathbb{G}^n = \mathbb{S}_N$, and defined as $g_{\text{SExt}, m, n}(z) := \text{sint}_m(z) \mod N$. For this gate (and further gates described in the coming sections), we use the following equation for $\text{sint}_m$ from [30]:

$$\text{sint}_m(z) = z' - 2^{m-1}, \text{ for } z' = z + 2^{m-1} \mod M \quad (1)$$

We denote the corresponding offset gate class by $G_{\text{SExt}}^{\text{offset}}$ and the offset functions by:

$$g_{\text{SExt}, m, n}^{[\alpha, \beta_1, \beta_2]}(x) = \beta_2 + g_{\text{SExt}, m, n}(x - \beta_1)$$

$$= (x' - \beta_1) \mod M - 2^{m-1} + \beta_2$$

where $x' = x + 2^{m-1} \mod M$. Observe that for $a, b \in \mathbb{U}_M$, the following equations hold for arithmetic in $\mathbb{Z}$:

$$(a - b) \mod M = a - b + M \cdot 1\{a < b\} \quad (2)$$

So, on using the above equation, we get:

$$g_{\text{SExt}, m, n}^{[\alpha, \beta_1, \beta_2]}(x) = x' - \beta_1 + M \cdot 1\{x' < \beta_1\} - 2^{m-1} + \beta_2$$

$$= (a - b) \mod M = a - b + M \cdot 1\{a < b\} \quad (3)$$

We present our construction of the FSS gate for Signed-Extension in Figure 1 and provide the summary of cost along with proof of security in Theorem 2. The proofs for subsequent FSS protocols in our paper can be derived in a similar manner, and we omit them.

![Fig. 1. FSS Gate for Signed-Extension $G_{\text{SExt}}$. $b$ refers to party id.](image)

**Theorem 2.** There is an FSS gate $(G_{\text{SExt}}, \text{Eval}_{\text{SExt}})$ for $G_{\text{SExt}}$ with keysize of $n$ bits plus the keysize of DCF$_{m, \mathbb{G}_N}$ and 1 invocation of DCF$_{m, \mathbb{G}_N}$ in Eval$_{\text{SExt}}$.

**Proof.** For $b \in \{0, 1\}$, the simulator $\text{Sim}_b$ for signed-extension gate is given $x$ and $u_b$ (the input and output of the ideal functionality). It first invokes $\text{Sim}_b$, the simulator for DCF over $m$ bits, which outputs a DCF key $k'_b$. It then computes $x'$ and $t$ in the same manner as Steps 2 and 3 in Eval$_{\text{SExt}}$ in Figure 1. Finally, it computes $r_b = u_b - b \cdot x' - t$ and outputs $k'_b||r_b$. One can...
easily see that the output of $\text{Sim}_b$ is computationally indistinguishable from $k_0$, the output of $\text{Gen}_{S\text{Ex}}^n$. □

### 3.2 Truncate-Reduce

The functionality of Truncate-Reduce (TR) for an $n$-bit number $x \in \mathbb{S}_N$ by $s$-bits is defined as dropping the last $s$ bits and returning the result as a $(n-s)$-bit number $y \in \mathbb{S}_{2^n-s}$.

The $\mathcal{G}_{\text{TR}}$ gate is the family of functions $\mathcal{G}_{\text{TR},n,s} : \mathbb{S}_N \rightarrow \mathbb{S}_{2^n-s}$ parameterized by input group $\mathbb{G}^n = \mathbb{S}_N$ and output group $\mathbb{G}^\text{out} = \mathbb{S}_{2^n-s}$, and defined as $\mathcal{G}_{\text{TR},n,s}(x) := x_{[s,n]}$. One straightforward way to realize the FSS gate for Truncate-Reduce is to use an FSS gate for arithmetic right shift operation that produces the result in $n$ bits (see Table 1) followed by a local modulo operation to get rid of higher order $s$ bits. In particular, $\mathcal{G}_{\text{TR},n,s}(x) = x \gg_A s \mod 2^{n-s}$, where $\gg_A$ represents the Arithmetic Right Shift (ARS) of $x$ by $s$ bits. This construction would have a total key size of $n$-bits plus the key size of $\text{DCF}_{s,\mathbb{S}_{2^n-s}}$ and $\text{DDCF}_{n-1,\mathbb{S}_N \times \mathbb{S}_N}$. In the following text, we provide a new construction that only uses a single DCF key, $\text{DCF}_{s,\mathbb{S}_{2^n-s}}$.

We denote the corresponding offset gate class by $\hat{\mathcal{G}}_{\text{TR}}$ and the offset functions by:

$$\hat{\mathcal{G}}_{\text{TR},n,s}^{[p,n,\text{out}]}(x) = (x - r_{[s,n]}^n) + r_{\text{out}}^n \mod 2^{n-s}$$

$$= (x + y)_{[s,n]} + r_{\text{out}}^n \mod 2^{n-s}$$

where $y = 2^n - r_{[s,n]}^n$. Using the relation from [30], we can re-write Truncate-Reduce as follows.

$$\hat{\mathcal{G}}_{\text{TR},n,s}^{[p,n,\text{out}]}(x) = x_{[s,n]} + y_{[s,n]} + 1\{x_{[0,s]} + y_{[0,s]} > 2^{s-1}\} + r_{\text{out}}^n \mod 2^{n-s}$$

Based on this, we present our construction of the FSS gate for Truncate-Reduce in Figure 2 and summarize its cost below:

**Theorem 3.** There is an FSS gate $(\text{Gen}_{\text{TR}}^n, \text{Eval}_{\text{TR}}^n)$ for $\hat{\mathcal{G}}_{\text{TR}}$ with key size of $(n - s)$ bits plus the key size of $\text{DCF}_{s,\mathbb{S}_{2^n-s}}$ and 1 invocation of $\text{DCF}_{s,\mathbb{S}_{2^n-s}}$ in $\text{Eval}_{n,s}$.

### 4 Linear Layers

Machine learning models that use low bitwidth fixed-point numbers to represent the model parameters, i.e., weights, as well as activations, inherently rely on very
Truncate-Reduce Gate (GenTR, EvalTR)

\footnotesize
\begin{align*}
\text{GenTR}_{m,s}(1^n, r^{in}_b, \alpha^{out}): & \\
1: & \text{Let } y = (2^n - r^m) \in \mathbb{S}_N \text{ and } \alpha^{out} = y|_{0,s} \\
2: & (k_0^{(s)}, k_1^{(s)}) \leftarrow \text{Gen}^\alpha_{m,s}(1^n, \alpha^{out}, 1, \mathbb{S}_{2^m-1}) \\
3: & \text{Sample random } r_0, r_1 \leftarrow \mathbb{S}_{2^n-s} \text{ s.t. } \\
& r_0 + r_1 = \alpha^{out} + y|_{s,n} \mod 2^n - s \\
4: & \text{return } (k_0, k_1), \text{ where } k_b = k_b^{(s)} \| r_b \text{ for } b \in \{0,1\}.
\end{align*}

\textbf{EvalTR}_{m,s}(b, k_b, x):

\begin{align*}
1: & \text{Parse } k_b \text{ as } k_b^{(s)} \| r_b. \\
2: & \text{For } x^{(s)} = 2^x - x|_{(0,s)} - 1, \text{ } t_b \leftarrow \text{Eval}_s^\alpha(b, k_b^{(s)}, x^{(s)}) \\
3: & \text{return } b \cdot x|_{(n,s)} + r_b + t_b.
\end{align*}

\textbf{Fig. 2.} FSS Gate for Truncate-Reduce \(G_{\text{TR}}\), \(b\) is party id.

precise computation of intermediate operations [16, 17, 24]. For linear layers, this corresponds to values being multiplied and accumulated over high bitwidths, before being truncated to required output bitwidth. Below, we discuss our FSS protocols for such linear layers, starting with the basic operation of element-wise multiplication followed by matrix multiplications and convolutions. Since all our benchmarks use signed arithmetic, we focus on signed multiplications below. Unsigned operations are analogous and disussed in Appendix A.

4.1 Signed Multiplication

We define the signed multiplication gate \(G_{\text{mult}}\) as the family of functions \(g_{\text{mult},m,n} : \mathbb{S}_M \times \mathbb{S}_N \to \mathbb{S}_L\) with \(L = M \cdot N\) parameterized by input group \(G_{\text{in}} = \mathbb{S}_M \times \mathbb{S}_N\) and output group \(G_{\text{out}} = \mathbb{S}_L\), and given by \(g_{\text{mult},m,n}(a,b) := \sin_m(a) \ast \sin_n(b)\). Intuitively, this says that finite-bit signed values \(a\) and \(b\) are lifted to \(\mathbb{Z}\) and multiplied (note that taking a modulo with \(L\) for \(\ell = m+n\) does not loose any bits).

For \(a' = a + 2^{m-1} \mod M\) and \(b' = b + 2^{n-1} \mod N\), using Equation (1) we get:

\[
g_{\text{mult},m,n}(a,b) = (a' - 2^{m-1}) \cdot (b' - 2^{n-1}) \mod L = a' \cdot b' - a' \cdot 2^{n-1} - b' \cdot 2^{m-1} + 2^{m+n-2} \mod L
\]

We denote the corresponding offset gate class by \(G_{\text{mult}}^{\text{out}}\) and the offset functions by:

\[
g_{\text{mult},m,n}^{\text{out}}(x,y) = g_{\text{mult},m,n}(x - r^{in}_1, y - r^{in}_2) + r^{out} = (x' - 2^{m-1}) \mod M \cdot (y' - 2^{n-1}) \mod N - 2^{m-1} + r^{out} \mod L
\]

where \(x' = x + 2^{m-1} \mod M\) and \(y' = y + 2^{n-1} \mod N\). We used Equation (1) in the above expression. Now, using Equation (2) we have:

\[
g_{\text{mult},m,n}^{\text{out}}(x,y) = \left(x' - r^{in}_1 - 2^{m-1} + 2^{m} \cdot 1\{x' < r^{in}_1\}\right) \cdot (y' - r^{in}_2 - 2^{n-1} + 2^{n} \cdot 1\{y' < r^{in}_2\}) + r^{out} \mod L
\]

We observe that the above relation requires two comparisons, one for \(x'\) with \(r^{in}_1\) and second for \(y'\) with \(r^{in}_2\). However, if we implement above relation naively with FSS it will require a 2-round protocol. The first round will compute shares of comparison outputs and the second round will do the multiplications (with shares of \(r^{in}_1\) and \(r^{in}_2\), resp.). Below, we provide a 1-round protocol for this gate using the observation that the values that need to be multiplied with \(1\{x' < r^{in}_1\}\) (resp., \(1\{y' < r^{in}_2\}\)) are either known to the servers or the dealer. With this observation, we can increase the DCF payload to 2 group elements each and compute the whole expression in a single round. First, we re-arrange the above expression to separate out the terms known to the servers and the dealer.

Based on above re-arrangement, we present our construction of the FSS gate for Signed Multiplication in Figure 3 and summarize its cost below:

\textbf{Theorem 4.} There is an FSS gate \((g_{\text{mult}}^{\text{out}}, \text{Eval}_{m,n}^{\text{mult}})\) for \(G_{\text{mult}}\) which has a total keysize \((3 \cdot m+n)\) bits plus the key size of \(\text{DCF}_{m,2^n}\) and \(\text{DCF}_{n,2^n}\) and requires 1 invocation each of \(\text{DCF}_{m,2^n}\) and \(\text{DCF}_{n,2^n}\) in \(\text{Eval}_{m,n}^{\text{mult}}\).

Remark: We use the above protocol for signed multiplication to realize element-wise multiplications occurring in Hadamard Product layers in our benchmarks.

4.2 Matrix Multiplication and Convolution

Consider signed matrix multiplication of \(A \in \mathbb{S}^{d_1 \times d_2}\) and \(B \in \mathbb{S}^{2^{d_2} \times d_3}\). In the resulting matrix \(C = A \cdot B\) each element is a result of \(d_2\) multiplications and \(d_2 - 1\) additions. Even if we store the result of multiplication in a larger ring \(\mathbb{S}_{2^{m+n}}\), similar to signed multiplication, and do \(d_1\) additions over these, the result can still overflow due to additions. To avoid such an overflow, underlying
libraries assume that computation is happening over a sufficiently large domain. Note that \(\ell = m + n + \lfloor \log d_2 \rfloor\) suffices to avoid any overflows during whole dot product computation. Even though the compute looks similar to what we discussed in last section, matrix multiplication over mixed bitwidths cannot be computed in a single round, due to the need for a ring larger than \(\mathbb{Z}_{2^{m+n}}\). In particular, the element-wise multiplications of \(x\) and \(y\) performed during dot-product would also have a term \(2^m \{x' < \ell_1\} \cdot 2^n \{y' < \ell_2\}\) that does not become 0 when taking a modulo over \(L > 2^{m+n}\). Hence, we need to perform an explicit multiplication between 2 comparison outputs that are secret shared between the servers, and this leads to an additional round of interaction.

Since we are going to take 2 rounds, we use the approach of \textit{extend-then-multiply}. For this, there are two ways. The first one follows the approach of \cite{30} that extends the entries of one of the matrix by minimal amount, i.e., \(\lfloor \log d_2 \rfloor\) then performs mixed bitwidth multiplication as discussed in previous section. However, unlike \cite{30} this approach is quite expensive for FSS, as the payload of DCF key for comparison used for a value, say \(a\) in matrix \(A\), grows with number of elements in \(B\) it is multiplied with, i.e., \(d_3\). Similarly, for elements in \(B\), the payload of DCF key grows with \(d_1\). So, say, we extend the entries of \(B\) from \(n\) to \(n' = n + \lfloor \log d_2 \rfloor\) and then carry out a protocol similar to that of signed multiplication. Then, key would have size, roughly, \(d_2d_2(n' + \text{DCF}_{n, U_{\ell'}})\) for extensions plus \(d_1d_2\text{DCF}_{m, U_{\ell}}\) for comparisons on entries in \(A\) plus \(d_2d_3\text{DCF}_{n', U_{\ell}}\) for comparisons on entries in sign-extended \(B\) plus \(\ell(d_1d_2 + d_2d_3 + d_3d_1)\) per party.

Now we describe the second approach that is much more efficient for FSS. In the first round, we sign extend entries of both \(A\) and \(B\) to \(\ell = m + n + \lfloor \log d_2 \rfloor\)-bits using Signed-Extension gate from Section 3.1. In the second round, we do uniform bitwidth matrix multiplication as in \cite{7} (see Table 1). While the round complexity of this FSS protocol is still 2, the total keysize is only \(d_1d_2(\ell + \text{DCF}_{m, U_{\ell}}) + d_2d_3(\ell + \text{DCF}_{n, U_{\ell}})\) for sign extensions plus \(\ell(d_1d_2 + d_2d_3 + d_3d_1)\) for matrix multiplication.

Convolutions. These can be directly reduced to matrix multiplications, but a trivial translation expands the input matrices and amounts to larger than necessary key size. To reduce keysize, following the ideas from \cite{25}, we sign extend and secret-share the input masks (as we do in uniform bitwidth matrix multiplication) for the input matrices instead of expanding and then calling signed extension and mask sharing.

Fixed-Point Mixed-Bitwidth Matrix Multiplication. When operating over fixed-point arithmetic, the input matrices apart from the specified bitwidths also have respective scales, say, \(s_m\) and \(s_n\). After following the above procedure of extend-then-multiply, the result is in bitwidth \(\ell = m + n + \lfloor \log d_2 \rfloor\) and scale \(s = s_m + s_n\). The fixed-point model would also specify the required bitwidth and scale for the output, say, \(n_\ell\) and \(s_\ell\). Now we adjust to this using appropriate truncation operations as follows: We can safely assume that \(s_\ell \leq s\) and \(n_\ell \leq \ell\). In the third round, we reduce the scale of output by \(s_\ell = s - s_\ell\) and adjust the bitwidth to \(n_\ell\). We have the following two cases:

\(- \quad n_\ell \leq (\ell - tr): In this case, parties first compute a Truncate-Reduce gate to truncate and reduce by \(tr\) bits to obtain the result in \(\ell - tr\) and scale \(s_\ell\). Next, parties locally compute a modulo \(2^{n_\ell}\) to obtain output with correct bitwidth and scale.

\(- \quad n_\ell > (\ell - tr): Here, parties compute an arithmetic right shift by \(tr\) bits to obtain the result in \(\ell\) bits and scale \(s_\ell\). This is followed by a local modulo operation by \(2^{n_\ell}\) to obtain the output with correct bitwidth and scale.

\begin{landscape}
\begin{figure}[h]
\centering
\begin{tabular}{|c|c|}
\hline
Signed Multiplication Gate \(\text{Gen}_{\text{smult} \ n, n}^\text{smult} \), \(\text{Eval}_{\text{smult} \ n, n}^\text{smult}\) & \\
\hline
\text{Gen}_{\text{smult} \ n, n}^\text{smult}(1^\lambda, r_1^\ell, r_2^\ell, \text{rout}): & \\
\hline
1: Sample random \(r_{10}, r_{11} \leftarrow S_{L}\) s.t. & \\
\hspace{1cm} \(r_{10} + r_{11} = \text{sin}_m(-r_{in}) \mod L.\) & \\
2: Sample random \(r_{20}, r_{21} \leftarrow S_{L}\) s.t. & \\
\hspace{1cm} \(r_{20} + r_{21} = \text{sin}_m(-r_{in}) \mod L.\) & \\
3: Sample random \(r_0, r_1 \leftarrow S_L\) s.t. & \\
\hspace{1cm} \(r_0 + r_1 = \text{sin}_m(r_{in}) \cdot \text{sin}_n(r_{in}) + \text{rout} \mod L.\) & \\
4: Let \(\beta_1 = (1, -r_{in}) \in S_M\) and \(\beta_2 = (1, -r_{in}) \in S_M.\) & \\
5: \((k_{10}, k_{11}) \leftarrow \text{Gen}_{\text{smult} \ n, n}^\text{smult}(1^\lambda, r_{in}, \beta_1, S_M^\lambda).\) & \\
6: \((k_{20}, k_{21}) \leftarrow \text{Gen}_{\text{smult} \ n, n}^\text{smult}(1^\lambda, r_{in}, \beta_2, S_M^\lambda).\) & \\
7: For \(b \in \{0, 1\}, \) let \(k_b = k_{1b} | k_{2b} | r_{1b} | r_{2b} | r_b.\) & \\
8: \text{return} \((k_0, k_1).\) & \\
\hline
\end{tabular}
\caption{FSS Gate for Signed Multiplication \(\mathcal{G}_{\text{smult}}, b\) is party id.}
\end{figure}
\end{landscape}
Unidirectional Bitwidth Linear Layers. For our benchmarks that have linear layers working over uniform bitwidth and scale, say \( n \) and \( s \), (signed) arithmetic, we use the known matrix multiplication FSS gate from [7] to obtain output in \( n \)-bits and scale \( 2s \) followed by truncation by \( s \) using arithmetic right shift gate from [3] to adjust the output scale to \( s \) (see Table 1 to obtain the costs).

5 Math Functions

In this section, we first discuss our novel FSS-based protocols for precise math functions for the same input and output domains considered in [30]. To quantify how precise our math function implementations are, we use the standard notion of ULP error (defined formally in [14]) that we discuss below. Then, we provide a high-level overview for the design of our math functions, followed by mixed-bitwidth splines that are crucial to obtain low-bitwidth splines that are good approximations to math functions. Finally, we provide details on FSS-friendly math function design for popular activations – sigmoid and tanh – used in neural networks.

ULP Error. It is impossible to represent an irrational value exactly using finite number of bits. Therefore, it is important to quantify the deviations between exact real result and the output of a math library in finite-bit representation. There are various notions of errors one can use – absolute, relative and ULP error. Standard math libraries use ULP error as a metric to determine whether the real output of a math function is approximately close to the finite-bit output that the library produces [1, 38]. The lower the ULP value, the higher is the precision and accuracy of the implementation of that math function. At a high level, ULP error between the exact real result \( r_1 \) and library output \( r_2 \) is equal to the number of representable values between \( r_1 \) and \( r_2 \) [14]. We use the same notion to quantify the precision of our math functions that use fixed-point as the finite-bit representation.

Design of Math Functions. Although our techniques are general, for a high level discussion, let us assume that we want to approximate sigmoid within 4 ULPs of error over fixed-point inputs and outputs with bitwidth 16 and scale 12. There are multiple design choices possible in coming up with such an implementation. For instance, SIRNN [30] used the recipe of first obtaining a good initial approximation followed by Goldsmith's iterations where the number of iterations depend on final output scale precision desired. However, this approach leads to large number of online rounds and communication due to the iterative nature of the algorithm. Our first design choice is to use piecewise polynomials or splines to approximate math functions as these allow for one-round low communication protocols using FSS techniques [3, 7]. However, we notice that for obvious reasons, uniform bitwidth splines cannot be used to obtain low ULP errors. In particular, for the above mentioned case, we cannot find a reasonable spline that uses coefficients on 16-bits, does all arithmetic over 16-bits, and provides at most 4 ULP error for scales 12. Similar to [30], the polynomial computation in splines needs to happen with intermediate results in higher bitwidths and final result needs to be truncated (to reduce bitwidth and adjust scale). Here, intuitively, while evaluating polynomials one keeps accumulating bitwidth and only reduces the final result.

We discuss details of our mixed-bitwidth splines followed by our protocols for precise math functions.

5.1 Mixed-Bitwidth Splines

We first discuss the cleartext functionality for the mixed-bitwidth splines followed by their FSS implementation. For ease of exposition, we discuss the splines over integers (no scale) and later show how to handle fixed-point arithmetic that has an associated scale with each value.

Suppose the spline under consideration is composed of \( m \) polynomials \( f_1, f_2, \ldots, f_m \) (with degree \( d \) and coefficients of bitwidth \( n_c \)), and a set of \( m+1 \) knots \( P \). Let the input to the spline be \( x \in S_{N_Z} \), where the bitwidth of \( x \) is \( n_T \) and \( N_T = 2^{n_T} \). Let \( n \) be a sufficiently large bitwidth that prevents overflow of values during polynomial evaluation. Note that \( n = n_c + d \cdot n_T \) suffices for this purpose. The functionality \( S_{\text{spline},(n_z,n_c),m,d,P,(f_i),} : S_{N_Z} \to S_N \) for mixed-bitwidth splines is defined as follows. First, sign extend the input \( x \) from \( n_T \) bits to \( n \), resulting in \( \text{sint}(x) \mod N \). Then, sign extend the coefficients of all polynomials from \( n_c \) bits to \( n \), and sign extend all knots from \( n_T \) bits to \( n \). Finally output the result of uniform bitwidth spline functionality on input \( \text{sint}(x) \) using the new (sign extended) coefficients and knots. Note that this evaluation procedure (mod \( N \)) is the same as doing all computation over \( \mathbb{Z} \).

Now we describe a simple (yet non-optimized) 2-round FSS-based protocol for this spline evaluation. For the first round, parties call the FSS gate for Signed-Extension \( S_{\text{Ext},n_z,n_c} \) with input \( x \in S_{N_Z} \) masked by \( r^n \), and reconstruct \( \bar{x} = \text{sint}(x - r^n) + r_{\text{temp}} \mod N \in S_N \).
Here, \(r_{\text{temp}} \in S_N\) is chosen randomly by the dealer during key generation. For the second round, let \(\bar{f}_i\) be the polynomial corresponding to \(f_i\) with (publicly known) coefficients sign extended to \(n\)-bits from \(n_c\) bits. Also, let \(P\) be the sign extended (publicly known) knots to \(n\)-bits. Next, parties call the FSS gate for uniform-bitwidth splines \(g_{\text{spline}, n, m, d, P, \{\bar{f}_i\}}\) from [3]. The input to this gate is \(\bar{x} \in S_N\) from the first round, masked by \(r_{\text{temp}}\). The output would be the spline evaluated on \((\bar{x} - r_{\text{temp}})\) masked by \(r_{\text{out}} \in S_N\).

### 5.1.1 Optimizations

We propose two optimizations to the protocol described above that significantly reduce the FSS key size and offline and online computational cost.

**Optimization 1.** Here, we reduce the FSS key size for the spline gate used in the second round. In the above construction, the DCF key used in the spline operates over inputs from \(S_N\) (obtained after sign extending the original input) and uses the sign extended knots during the key generation by dealer and evaluation by servers. Our observation is as follows: Even though the parties need to learn the shares of coefficients of the correct polynomial to be evaluated in the larger domain, i.e., \(S_N\), the DCF input (and hence, its depth, etc) and knots themselves can come from the original domain \(S_{N_T}\) where \(N_T = 2^{n_T}\). In particular, we replace \(\text{DCF}_{n_i, n_T, m, d, P, \{\bar{f}_i\}}\) in the above unoptimized scheme by \(\text{DCF}_{n_i, S_{N_T}, m, d, P, \{\bar{f}_i\}}\) and evaluate it on the original input \(x\) instead of \(\bar{x}\). This reduces the key size and the number of PRG calls made in key generation and evaluation of the uniform bitwidth spline gate (used in the second round of our protocol) by a factor of \(n/n_T\). For instance, for the case of sigmoid, this reduction factor is \(4\times\).

**Optimization 2.** This optimization significantly reduces the number of PRG calls made by the servers during the online phase. Recall that the FSS gate for \(g_{\text{spline}, n, m, d, P, \{\bar{f}_i\}}\) used in the second round uses a DCF with a payload of \(m(d+1)\) bits. This DCF key gets evaluated \(m\) times and the output of each invocation is \(m(d+1)\) bits. Let the output of the \(i^{th}\) invocation be \(s_1^{(i)}, \ldots, s_m^{(i)}\) (using the same notation as Figure 5 in [3]). We observe that only \(s_{i-1}^{(i)}, s_i^{(i)}\), i.e., \(2(d+1)m\) bits, are used during evaluation and other values are discarded.

The \(\text{Gen}_{s_i}\) algorithm of the DCF construction of [3] generates a key \(k_b\) for each party \(b \in \{0, 1\}\) such that \(k_b\) consists of a random seed \(s_b\) and \(n+1\) correction words \(CW_i \in G\) for \(i \in \{1\ldots n+1\}\) (where \(G\) denotes the output group). The seed generates a binary tree with \(2^n\) leaves and each node is associated with a tuple \((s_b, t_b, V_b)\) with an invariant that the sum of \(V_b\) along the evaluation path for an input \(x\), form secret shares of \(f_{\text{spline}}^x\). Hence, in \(\text{Eval}_{n_i}^x\), it suffices to perform this addition along the respective path to get the desired output. In the case of splines, where \(G\) is a vector of group elements, these additions are performed element-wise. Since we need only \(2\) out of the \(m\) elements in the output of \(\text{Eval}_{n_i}^x\), we can tweak \(\text{Eval}_{n_i}^x\) to only do the required additions and PRG calls. This change reduces the number of PRG calls in the spline evaluation by roughly a factor of \(m/2\). For the above example of sigmoid, this factor is roughly \(10\times\).

**Overall Performance Improvement.** The two optimizations discussed above are compatible with each other and together lead to roughly a factor \(n/n_T\) reduction in FSS key size, \(n/n_T\) reduction in PRG calls by the dealer in key generation and \(n/n_T \cdot m/2\) reduction in PRG calls by the servers in the online phase. For the case of sigmoid, this amounts to \(4\times\) reduction in key size, \(4\times\) fewer PRG calls by the dealer and \(40\times\) fewer PRG calls by the servers.

We now discuss how we can easily extend our protocol to work over fixed-point arithmetic as required.

### 5.1.2 Fixed-Point Arithmetic

In the context of fixed-point arithmetic, let the scale of the input \(x\) be \(s_T\), and that of the spline coefficients be \(s_c\). In symbolic notation, let \(f_i(x) = \sum_{j=0}^d a_{i,j} \cdot x^j\). The first round of our FSS protocol remains the same. Note that during polynomial evaluation in the spline, we require all the summands to have the same scale. This requires a small change in the dealer as follows: The dealer sign extends the coefficients from \(n_c\) to \(n\) bits and also left shifts \(a_{i,j}\) by \((d-j)\cdot s_T\) bits. So, during evaluation the scale of each summand of the polynomial is the same, viz. \(s = s_c + d \cdot s_T\). Now, after running the above described protocol, we have the output with bitwidth \(n\) and scale \(s\).

Next, suppose the desired output is required to have bitwidth \(n_Q\) and scale \(s_Q\). We do this adjustment in the final round as follows: We can safely assume that \(s_Q \leq s\) as to obtain precise output with scale \(s_Q\), scales of the coefficients will have to be appropriately large as well. Now, in the third round, we reduce the scale of output by \(tr = s - s_Q\) and adjust the bitwidth to \(n_Q\) by using appropriate truncation operations as discussed in “fixed-point mixed-bitwidth matrix multiplication”
Theorem 5. Let \( \text{params} = (n_I, s_I, n_O, s_O, n_c, s_c) \), \( n = n_c + d \cdot n_I \), \( \text{tr} = s_c + d \cdot s_I - s_O \). There is a 3-round FSS protocol \( G(\text{params}, m, d, P) \) for mixed-bitwidth splines over fixed-point that has a total key size of \( 2mn(d+1) + n \) bits, plus the key size of DCF \( n_c + d \cdot s_I + (d+1) \) + \( \text{tr} \), plus the key sizes of FSS gates for \( g_{\text{Ext}}, n_c, n \) and \( g_{\text{TR}}, n, \text{tr} \). Let \( \ell = \lceil \frac{2n(d+1)}{s_c \cdot \lambda} \rceil \), where \( \lambda \) is the security parameter. The online phase makes single evaluations of Sign-Extension and Truncate-Reduce gates and at most \( m(n_O(1+\ell) + \ell) \) calls to PRG \( G \) (used in DCF) during spline evaluation (in the second round).

5.2 Math Functions

In this section, we discuss our approach for computing math functions using FSS techniques – in particular, sigmoid and tanh. We use mixed-bitwidth splines over fixed-point as approximations to math functions that can be realized directly using the 3-round protocol described in the previous section. Below, we discuss how we obtain the required splines for each of the math functions. We use sigmoid to illustrate this.

Sigmoid. Over the reals, \( \text{sigmoid}(x) = 1/(1 + e^{-x}) \) and tends to 0 for small values of \( x \) and tends to 1 for large values of \( x \). Our task is as follows: Given the bitwidths and scales for the inputs and outputs, find a spline that approximates the real result with at most 4ULPs of error (see beginning of this section for the definition of ULP error). The first step is to clip the input domain to an interesting interval as follows: We find the largest \( x_L \) and the smallest \( x_R \) such that if we set \( \text{sigmoid}(x) = 0 \) (with appropriate fixed-point representation of outputs) for all \( x \leq x_L \) and set \( \text{sigmoid}(x) = 1 \) for all \( x \geq x_R \), the resulting ULP error \( \leq 4 \). In the second step, we start with a choice of degree of the polynomials, \( d \), and number of knots, \( m \), and run an off-the-shelf tool Octave [12] to find a best fit spline for sigmoid for the reduced domain. Note that this step, returns a floating-point spline, i.e., both polynomial coefficients as well as knots are floating-point values.

In the third step, we quantize this spline, i.e., represent it over fixed-point as follows: We quantize the knots to have the same bitwidth and scale as our inputs. We linearly search over bitwidths and scales for the coefficients. For a choice of \( n_c, s_c \), we exhaustively run the mixed-bitwidth spline cleartext algorithm for all inputs and check for their ULP error w.r.t. the output of a high-precision math library [13]. We crucially note that since sigmoid (and also tanh) are well-behaved functions with bounded outputs, and the output scale \( s_O \leq 14 \), this exhaustive testing is feasible. If the maximum ULP error \( \leq 4 \) we stop. Otherwise, we increase the value of either \( n_c \) or \( s_c \) until \( n_c = 32 \). If we do not find a good approximation, we increase the number of knots, \( m \), until 100; even if this is unsuccessful, we increment the degree \( d \) and go back to the second step of spline finding.

Following the above procedure, we successfully find splines with \( d = 2, m \leq 52 \) for the sigmoid function for input and output scales such that \( 0 \leq s_I, s_O \leq 14 \) and this suffices for our benchmarks as well as benchmarks considered in prior works. As is expected from a 2D graph of sigmoid, we were unable to find linear splines for sigmoid (and also tanh) with even 100 knots.

Tanh. Over the reals, \( \text{tanh}(x) = (e^x - e^{-x})/(e^x + e^{-x}) \) and tends to \(-1\) for small values of \( x \) and \( 1 \) for large values. Our procedure for tanh is identical to sigmoid except for a straightforward change to clipping in terms of outputs on inputs with large magnitude.

Reciprocal Square Root. Over reals, \( \text{rsqrt}(x) = 1/\sqrt{x}, x > 0 \). To avoid division-by-zero error when \( x \) is very small, we assume that all inputs to \( \text{rsqrt} \) satisfy \( x \geq \epsilon \), where \( \epsilon \) is a small public constant. As in the case of SIRNN [30], we set \( \epsilon = 0.1 \). The procedure to find splines is similar to sigmoid and tanh with one difference. We observe that since the precision of input \( x \) is \( s_I \), it suffices to compute the output with precision \( s_T/2 \). Hence, the ULP error of spline obtained is computed over bitwidth \( n_O \) and scale \( s_T/2 \), instead of \( s_O \). Later, we adjust the scale of spline output to \( s_O \) by left-shifting the output by \((s_O - s_T/2) \) bits.

Sample Choice of Parameters. Table 2 lists the choice of our spline parameters that give at most 4 ULP error for various configurations of math functions required by our benchmarks in Section 6.2. In Appendix
LLAMA: A Low Latency Math Library | 286

Table 2. Spline parameters (degree $d$, number of intervals $m$, coefficient bitwidth $n_c$, coefficient scale $s_c$) with at most $4$ ULPs error, for varying input bitwidth $n_I$, output bitwidth $n_O = n_I$, input scale $s_I$, output scale $s_O$.

<table>
<thead>
<tr>
<th>Function</th>
<th>$n_I = n_O$</th>
<th>$s_I$</th>
<th>$s_O$</th>
<th>$d$</th>
<th>$m$</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sigmoid</td>
<td>16</td>
<td>8</td>
<td>14</td>
<td>2</td>
<td>34</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>9</td>
<td>14</td>
<td>2</td>
<td>34</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>11</td>
<td>14</td>
<td>2</td>
<td>34</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>13</td>
<td>14</td>
<td>2</td>
<td>29</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>12</td>
<td>12</td>
<td>2</td>
<td>19</td>
</tr>
<tr>
<td></td>
<td>37</td>
<td>12</td>
<td>12</td>
<td>2</td>
<td>20</td>
</tr>
<tr>
<td>Tanh</td>
<td>16</td>
<td>8</td>
<td>8</td>
<td>2</td>
<td>10</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>9</td>
<td>9</td>
<td>2</td>
<td>12</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>11</td>
<td>11</td>
<td>2</td>
<td>20</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>12</td>
<td>12</td>
<td>2</td>
<td>26</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>13</td>
<td>13</td>
<td>2</td>
<td>12</td>
</tr>
<tr>
<td></td>
<td>37</td>
<td>12</td>
<td>12</td>
<td>2</td>
<td>26</td>
</tr>
<tr>
<td>Reciprocal square root</td>
<td>16</td>
<td>10</td>
<td>9</td>
<td>2</td>
<td>10</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>12</td>
<td>11</td>
<td>2</td>
<td>10</td>
</tr>
</tbody>
</table>

In this section, we perform a novel evaluation of LLAMA and compare its performance with relevant prior works. In Section 6.1, we provide microbenchmarks that compare our protocols for mixed-bitwidth arithmetic and math functions with SIRNN [30] and MP-SPDZ [23], which are the prior state-of-the-art systems for precise implementations of these functions. SIRNN is a 2PC system in the semi-honest setting and MP-SPDZ is run in the semi-honest 2PC setting with trusted dealer. (SIRNN is optimized for end-to-end latency, while MP-SPDZ, like LLAMA, considers an offline-online split.) For these microbenchmarks, LLAMA reduces the online communication by two orders of magnitude and latency by $1.9 - 10\times$.

Next, in Section 6.2, we use LLAMA to run end-to-end secure inference on various neural network benchmarks and compare its performance with appropriate prior works that considered same benchmarks. We evaluate on the benchmarks considered in SIRNN that use math functions and/or mixed-bitwidth computations. We observe that online latency and communication using LLAMA is up to $57\times$ and $12000\times$ lower than SIRNN. To demonstrate scalability of LLAMA, we evaluate it on large convolutional networks such as ResNet-50 for ImageNet and contrast its performance with recent systems such as CrypTFlow [25] for 3PC and CrypTFlow2 [31] for 2PC. We also compare with DELPHI [28], a 2PC work that explicitly considers the question of online latency of 2-party secure inference. Finally, we compare LLAMA with AriaNN [34], an FSS-based secure inference framework (with erroneous ReLU and truncations and no support for math functions or mixed-bitwidth operations), on a benchmark considered in that work.

**Implementation Details.** LLAMA is implemented as a C++ library with ~7000 lines of code. The code is publicly available at https://github.com/mpc-msri/EzPC/tree/master/FSS. In addition to the FSS protocols for mixed-bitwidth and math functions, we also implement the relevant FSS schemes and gates proposed in [3], as their work does not have an implementation. All APIs for key generation and evaluation are parameterized by input and output bitwidths and scales, to easily support both uniform and mixed-bitwidth operations. As suggested in [39, Section 6], we use the Matyas-Meyer-Oseas one-way compression function (which uses AES in fixed-key mode) to generate pseudorandomness. This is done to avoid multiple expensive AES initialization operations in the DCF computation, which is the main cost of FSS protocols.

We have integrated LLAMA as a cryptographic backend to EzPC [8]. This allows us to compile fixed-point inference code written in EzPC, into FSS-friendly C++ code. Various frontends such as CrypTFlow [25] and SeeDot [17] can be easily used to obtain fixed-point EzPC code (with carefully chosen bitwidths and scales that preserve accuracy) for arbitrary machine learning network architectures.

**Experimental Setup.** We run our benchmarks on 3 virtual machines (one dealer and 2 servers), each with a 4-core 3.7 GHz Xeon processor and 16 GBs of RAM. In the LAN setting, all VMs are connected in a network with average bandwidth of 340 MBps and RTT of 0.96 ms, while in the WAN setting the corresponding numbers are 120 MBps and 72.3 ms respectively. The mean and standard deviation of both offline and online execution times are calculated over 100 runs.

### 6.1 Microbenchmarks

In this section we microbenchmark LLAMA on individual functions used in mixed-bitwidth arithmetic as well as math functions. For precise math functions, we compare with SIRNN [30] and MP-SPDZ [23]. Although SIRNN is a standard 2PC system and LLAMA is a 2PC system in the dealer model, SIRNN is the only work that...
considers secure computation of mixed-bitwidth operations, and hence we compare with it for these blocks. Microbenchmarks for bitwidth changing functions, i.e., signed-extension and truncate-reduce, and math functions, i.e., sigmoid, tanh and reciprocal square root are provided in Table 3, while those for mixed-bitwidth matrix multiplication are presented in Table 4. For Table 3, the choice of parameters for bitwidths and scales is shift amount. For Sigmoid, Tanh and Reciprocal square root, \( n_x = 16, s_x = 9, s_O = 14 \) and we evaluate for batch sizes of 100 and 1000. For the math functions, for these choice of bitwidths and scales, Table 2 provides details on the spline chosen by LLAMA, i.e., degree and number of knots, as well as coefficient bitwidths and scales. From Table 3, LLAMA is up to 40\( \times \) better than MP-SPDZ in online communication, up to 51\( \times \) better in terms of online rounds, and up to 12\( \times \) better in online runtime. As seen in Table 3, LLAMA communicates between 105 – 251\( \times \) less than SIRNN in the online phase and has between 13 – 61\( \times \) fewer rounds of online communication. In terms of performance, LLAMA is between 3.7 – 43\( \times \) faster than SIRNN in the LAN setting. Finally, as expected, while the total communication of LLAMA (i.e. communication including the offline key size as well) can be comparable to SIRNN in a few cases (e.g. Truncate-reduce), LLAMA does have a larger total communication (by up to 4.4\( \times \)) in other cases.

Table 4 summarizes our microbenchmarks for mixed-bitwidth matrix multiplication (multiplying a \( d_1 \times d_2 \) matrix with a \( d_2 \times d_3 \) matrix). The input/output bitwidths for all experiments are 8, while the scale is 6; however, due to \( d_2 \) being different in each case, the intermediate bitwidth \( (16 + [\log d_2]) \) in each computation is different. As can be seen from the table, LLAMA com-

<table>
<thead>
<tr>
<th>Layer</th>
<th>Batch Size</th>
<th>Technique</th>
<th>Communication (in KB)</th>
<th>Online Rounds</th>
<th>LAN (in milliseconds)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>Offline</td>
<td>Online</td>
<td>Offline</td>
</tr>
<tr>
<td>Signed-Extension ((m = 8, n = 21))</td>
<td>100</td>
<td>LLAMA</td>
<td>35</td>
<td>0.8</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td>1000</td>
<td>LLAMA</td>
<td>352</td>
<td>7.8</td>
<td>1</td>
</tr>
<tr>
<td>Truncate-Reduce ((n = 21, s = 13))</td>
<td>100</td>
<td>LLAMA</td>
<td>47</td>
<td>0.2</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td>1000</td>
<td>LLAMA</td>
<td>466</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>Sigmoid ((n_x = n_O = 16, s_x = 9, s_O = 14))</td>
<td>100</td>
<td>LLAMA</td>
<td>3297</td>
<td>3.5</td>
<td>3</td>
</tr>
<tr>
<td></td>
<td>1000</td>
<td>LLAMA</td>
<td>33044</td>
<td>35</td>
<td>3</td>
</tr>
<tr>
<td>Tanh ((n_x = n_O = 16, s_x = s_O = 9))</td>
<td>100</td>
<td>LLAMA</td>
<td>1320</td>
<td>3.5</td>
<td>3</td>
</tr>
<tr>
<td></td>
<td>1000</td>
<td>LLAMA</td>
<td>13219</td>
<td>35</td>
<td>3</td>
</tr>
<tr>
<td>Reciprocal square root ((n_x = n_O = 16, s_x = 12, s_O = 11))</td>
<td>100</td>
<td>LLAMA</td>
<td>1138</td>
<td>3.5</td>
<td>3</td>
</tr>
<tr>
<td></td>
<td>1000</td>
<td>LLAMA</td>
<td>11375</td>
<td>35</td>
<td>3</td>
</tr>
</tbody>
</table>

Table 3. Performance comparison for bitwidth changing and math functions. For Signed-Extension, \( m, n \) are input, output bitwidths. For Truncate-Reduce, \( n \) is input bitwidth and \( s \) is shift amount. For Sigmoid, Tanh and Reciprocal square root, \( n_x, n_O, s_x, s_O \) denote input/output bitwidths and scales. ** denotes that the value was not reported by the code.
municates between $59 - 208 \times$ less than SIRNN in the online phase and has $10 - 13 \times$ fewer rounds. LLAMA also performs between 2.2 - 7.3\times better in the LAN setting. Further, in these microbenchmarks, LLAMA also has $1.3 - 5.4 \times$ lower total communication.

### 6.2 Benchmarks

In this section, we evaluate and compare the performance of LLAMA on several machine learning inference algorithms. We provide details on the benchmarks considered in Appendix D and summarize the findings in Table 5. We split the discussion below into two kinds of benchmarks. Our main focus is the networks with math functions or networks that use low bitwidths for activations and weights for efficiency. We also consider simple convolutional neural networks (CNNs) from prior works to demonstrate our generality and scalability.

**Neural Networks with Math Functions/Mixed-Bitwidth Arithmetic.** First, to illustrate the performance of LLAMA on algorithms that use mixed-bitwidth arithmetic and/or math functions (tanh, sigmoid, or reciprocal square root), we run it on the following end-to-end inference benchmarks: DeepSecure B4 [32], that enables embedded sensors to classify various physical activities, as well as on an RNN algorithm [26] that enables keyword spotting on the Google-30 dataset [40]. We compare the performance of LLAMA with SIRNN and observe that LLAMA has up to 4 orders of magnitude lower online communication, up to 22\times fewer online rounds, and up to 57\times faster runtime. We also evaluate LLAMA on the sigmoid/tanh layers of the MiniONN LSTM [27] (a language model for word predictions) and the Industrial-72 benchmarks [24, 30] (a model that provides feedback for quality of shots in a sports game), as well as the reciprocal square root layers from the Heads model [35] (a model for counting the number of people in an image). Here, we show that, in comparison with SIRNN, the online communication of LLAMA is at least 200\times less, the number of rounds is at least 43\times better and the performance is at least 15\times and 43\times better in the LAN and WAN settings.

**Other Neural Networks.** While not the primary focus of this work, for the sake of completeness, we also compare LLAMA with prior systems on neural networks not requiring mixed-bitwidth arithmetic or math functions. We compare with DELPHI [28] – a 2PC system designed specifically with online cost in mind – and show $\approx 24\times$ better communication and $\approx 3 - 5 \times$ better runtime for online phase. To illustrate that LLAMA can scale to large benchmarks, we run it on the ResNet-50 CNN on the ImageNet dataset [18], and compare with both CrypTFow (a 3PC system) as well as CrypTFow2 (a 2PC system)$^6$. Finally, we consider AriaNN [34], which like LLAMA is an FSS based secure inference system (in the trusted dealer model), but does not support mixed-bitwidth arithmetic or math functions. Here alone, since AriaNN code [33] does not support execution on different VMs, we ran all parties in both AriaNN and LLAMA on the same VM and appropriately set the latency and bandwidth on the VM using the `tc` command$^7$. On the ResNet-18 benchmark on Hymenoptera dataset, we show that LLAMA outperforms AriaNN by about 3x in online communication and 1.7x in online runtime (despite AriaNN using prob-

---

$^6$ In very recent work, Cheetah [19] show an improvement of $\approx 12\times$ in communication and 4-5\times in runtime over CrypTFow2. We do not directly compare with this work, as it is orthogonal to the focus of this work; however, even in comparison to Cheetah, we note that LLAMA has much lower communication and is expected to outperform it.

$^7$ The end-to-end code execution time in AriaNN took around 40 minutes. In Table 5, we report the offline and online times (around 350 seconds and 13 seconds respectively) that is output by their code. Due to longer execution times, the mean and standard deviation of runtimes are calculated over 25 iterations.

### Table 5. Comparison for mixed-bitwidth matrix multiplication for dimensions $d_1 \times d_2$ and $d_2 \times d_3$, using bitwidths of 8 and scale 6. 

<table>
<thead>
<tr>
<th>Technique</th>
<th>Communication (in MB)</th>
<th>Online Rounds</th>
<th>LAN (in milliseconds)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Offline</td>
<td>Online</td>
<td>Online</td>
</tr>
<tr>
<td>LLAMA</td>
<td>10</td>
<td>200</td>
<td>1000</td>
</tr>
<tr>
<td>SIRNN</td>
<td>-</td>
<td>97.51</td>
<td>31</td>
</tr>
<tr>
<td>LLAMA</td>
<td>10</td>
<td>2000</td>
<td>100</td>
</tr>
<tr>
<td>SIRNN</td>
<td>-</td>
<td>108.36</td>
<td>39</td>
</tr>
<tr>
<td>LLAMA</td>
<td>200</td>
<td>200</td>
<td>200</td>
</tr>
<tr>
<td>SIRNN</td>
<td>-</td>
<td>206.87</td>
<td>41</td>
</tr>
</tbody>
</table>

---
<table>
<thead>
<tr>
<th>Network</th>
<th>Technique</th>
<th>Communication (in MB)</th>
<th>Online Rounds</th>
<th>LAN Time (in seconds)</th>
<th>WAN Time (in seconds)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Offline</td>
<td>Online</td>
<td>Rounds</td>
<td>Offline</td>
<td>Online</td>
</tr>
<tr>
<td>DeepSecure B4</td>
<td>LLAMA</td>
<td>183</td>
<td>0.15</td>
<td>21</td>
<td>0.78 ± 0.20</td>
</tr>
<tr>
<td></td>
<td>SIRNN</td>
<td>-</td>
<td>1844</td>
<td>379</td>
<td>-</td>
</tr>
<tr>
<td>Google-30</td>
<td>LLAMA</td>
<td>882</td>
<td>8.6</td>
<td>2687</td>
<td>5.31 ± 0.94</td>
</tr>
<tr>
<td></td>
<td>SIRNN</td>
<td>-</td>
<td>415</td>
<td>59899</td>
<td>-</td>
</tr>
<tr>
<td>MiniONN LSTM</td>
<td>(only Sigmoid, Tanh)</td>
<td>LLAMA</td>
<td>49.7</td>
<td>0.04</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td>SIRNN</td>
<td>-</td>
<td>9.7</td>
<td>403</td>
<td>-</td>
</tr>
<tr>
<td>Industrial-72 (only Sigmoid, Tanh)</td>
<td>LLAMA</td>
<td>19.4</td>
<td>0.03</td>
<td>42</td>
<td>0.09 ± 0.03</td>
</tr>
<tr>
<td></td>
<td>SIRNN</td>
<td>-</td>
<td>7.9</td>
<td>1847</td>
<td>-</td>
</tr>
<tr>
<td>Heads (only Reciprocal square root)</td>
<td>LLAMA</td>
<td>30</td>
<td>0.09</td>
<td>9</td>
<td>0.11 ± 0.03</td>
</tr>
<tr>
<td></td>
<td>SIRNN</td>
<td>-</td>
<td>18</td>
<td>545</td>
<td>-</td>
</tr>
<tr>
<td>MiniONN CNN</td>
<td>LLAMA</td>
<td>1084</td>
<td>8.2</td>
<td>25</td>
<td>4.61 ± 1.12</td>
</tr>
<tr>
<td></td>
<td>DELPHI</td>
<td>3258</td>
<td>196</td>
<td>**</td>
<td>36.73 ± 0.70</td>
</tr>
<tr>
<td></td>
<td>CrpyTFlow2</td>
<td>-</td>
<td>340</td>
<td>345</td>
<td>-</td>
</tr>
<tr>
<td>ResNet-50 (ImageNet)</td>
<td>LLAMA</td>
<td>78848</td>
<td>745</td>
<td>280</td>
<td>427.94 ± 45.32</td>
</tr>
<tr>
<td></td>
<td>CrpyTFlow2</td>
<td>-</td>
<td>31502</td>
<td>4053</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>CrpyTFlow</td>
<td>-</td>
<td>6549</td>
<td>&gt;7400</td>
<td>-</td>
</tr>
<tr>
<td>ResNet-18 (Hymenoptera)</td>
<td>LLAMA (local truncation)</td>
<td></td>
<td>5243</td>
<td>45</td>
<td>48</td>
</tr>
<tr>
<td></td>
<td>AriaNN</td>
<td>6702</td>
<td>148</td>
<td>**</td>
<td>339.02 ± 18.51</td>
</tr>
</tbody>
</table>

Table 5. Secure inference benchmarks using LLAMA and prior works. ** denotes that the value was not reported by the code.

7 Conclusion

This paper proposes LLAMA, an FSS-based 2PC secure inference system in the semi-honest, trusted dealer setting. The main design goal of LLAMA is to minimize online complexity. LLAMA proposes novel FSS-based constructions for signed-extension and truncate-reduce, which facilitate mixed-bitwidth operations and precise math functions based on low bitwidth splines. Due to the emphasis on lower online complexity, the offline phase in LLAMA can incur significant memory and bandwidth requirement (primarily due to large DCF keysize). It would be interesting to optimize memory usage in the offline phase in LLAMA.

8 Acknowledgement

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. We thank Rahul Sharma for his help in spline error calculation and Deevashwer Rathee, Mayank Rathee and Rahul Kranti Kiran Goli for their help in understanding existing code bases.

References

[1] Intel SVML. https://software.intel.com/content/www/
A Unsigned multiplication

Unsigned Multiplication of two values \( x \in \mathbb{U}_M, y \in \mathbb{U}_N \) refers to the multiplication of the two values \( \text{uint}_m(x) \) and \( \text{uint}_s(y) \) carried out in the group \( \mathbb{U}_L \), where \( L = M \cdot N \), which is equivalent to \( \text{uint}_m(x) \cdot \text{uint}_s(y) \).

The unsigned multiplication gate \( G_{\text{umult}} \) is the family of functions \( g_{\text{umult},m,n} : \mathbb{U}_M \times \mathbb{U}_N \rightarrow \mathbb{U}_L \) parameterized by input group \( G^m = \mathbb{U}_M \times \mathbb{U}_N \) and output group \( G^L = \mathbb{U}_L \), and given by \( g_{\text{umult},m,n}(x, y) := \text{uint}_m(x) \cdot \text{uint}_s(y) \).

We denote the corresponding offset gate class by \( \tilde{G}_{\text{umult}} \) and the offset functions by

\[
\tilde{g}_{\text{umult},m,n}(x, y) = g_{\text{umult},m,n}(x - r_1, y - r_2) + r_{\text{out}} = ((x - r_1) \mod M) \cdot ((y - r_2) \mod N) + r_{\text{out}} \mod L
\]

On further expanding we get:

\[
\tilde{g}_{\text{umult},m,n}(x, y) = 2^n \cdot 1 \{ y < r_2 \} \cdot x + 2^n \cdot 1 \{ x < r_1 \} \cdot y + 2^n \cdot 1 \{ x < r_1 \} \cdot (-r_1) + 2^n \cdot 1 \{ y < r_2 \} \cdot (-r_2) + (-r_1) \cdot y + (-r_2) \cdot x + r_1 \cdot r_2 + r_{\text{out}} + x \cdot y \mod L
\]

We omit the last term (i.e. \( 2^n \{ x < r_1 \} \cdot 2^n \{ y < r_2 \} \)) as it gets cancelled out by the modulus. Based on above rearrangement, we present our construction of the FSS gate for unsigned multiplication in Figure 4.

B Example of a Spline

In Table 6, we describe the mixed-bitwidth spline for tanh for the following parameters: \( n_x = n_O = 16, s_T = s_O = 8, n_c = 32, s_c = 18, d = 2, m = 10 \) (see Table 2, Section 5.2). The coefficients \( a_2, a_1, a_0 \) (corresponding to decreasing powers of \( x \)) for each spline polynomial are fixed-point numbers with bitwidth 32 and scale 18. The endpoints of spline intervals are 16-bit fixed point numbers with scale 8.

C Simple Activations

C.1 Average Pool

The average pool functionality requires signed division of a ring element by a public positive integer. Hence, we start by defining the division functionality, show how to realize it using FSS techniques, and then present our protocol for average pool.

We follow the notation and definitions from [31]. Let \( \text{idiv} : \mathbb{Z} \times \mathbb{Z} \rightarrow \mathbb{Z} \) denote signed integer division, that is,

<table>
<thead>
<tr>
<th>Spline Endpoints</th>
<th>Spline coefficients ( a_2x^2 + a_1x + a_0 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>Left</td>
<td>Right</td>
</tr>
<tr>
<td>0</td>
<td>198</td>
</tr>
<tr>
<td>199</td>
<td>398</td>
</tr>
<tr>
<td>399</td>
<td>598</td>
</tr>
<tr>
<td>599</td>
<td>798</td>
</tr>
<tr>
<td>799</td>
<td>32766</td>
</tr>
<tr>
<td>32767</td>
<td>-800</td>
</tr>
<tr>
<td>-799</td>
<td>-601</td>
</tr>
<tr>
<td>-600</td>
<td>-401</td>
</tr>
<tr>
<td>-400</td>
<td>-201</td>
</tr>
<tr>
<td>-200</td>
<td>-1</td>
</tr>
</tbody>
</table>

Table 6. Spline intervals and coefficients for tanh with \( n_x = n_O = 16, s_T = s_O = 8, n_c = 32, s_c = 18, d = 2, m = 10 \).
Let $\text{rdiv}(a,d) = b$ where $a = b \cdot d + r$, $r \in \mathbb{Z}$, $0 \leq r < d$. Let $\text{rdiv} : S_N \times Z \to S_N$ denote (signed) division of a ring element by a positive integer, defined as

$$\text{rdiv}(a,d) := \text{idiv}(\text{rint}(a),d) \mod N$$

where $0 < d < N$. Let $(a_0, a_1) \in S_N$ denote additive secret shares of $a$ over $S_N$, i.e., $(a_0)$ and $(a_1)$ are random elements of $S_N$ subject only to the constraint that $(a_0 + a_1) \mod N = a$. The following theorem [31] allows expressing $\text{rdiv}(a,d)$ in terms of $(a_0)$ and $(a_1)$.

**Theorem 6** (Division of ring element [31]). Let shares of $a \in S_N$ be $(a_0), (a_1) \in S_N$, for some $N = N_1 \cdot d + N_0 \in Z$, where $N_1, N_0, d \in Z$ and $0 \leq N_0 < d < N$.

Let the unsigned representation of $a, \langle a_0 \rangle, \langle a_1 \rangle$ in $S_N$ lifted to integers be $a_0, a_0, a_1 \in \{0, 1, \ldots, N - 1\}$, respectively, such that $a_0 = a_1 + d \cdot a_0$ and $a_1 = a_1 + d \cdot a_0^2$, where $a_0, a_0, a_1, a_1, a_1, a_1 \in Z$ and $0 < a_0, a_0, a_1, a_1 < d$. Let $N' = [N/2]$. Let $\text{corr}, A, B, C \in Z$ be defined as

$$\text{corr} = \begin{cases} -1 & (a_0 \geq N') \land (a_0 < N') \land (a_1 < N') \\ 1 & (a_0 < N') \land (a_0 \geq N') \land (a_1 > N') \\ 0 & \text{otherwise} \end{cases}$$

$$A = a_0 - a_0^2 - \{1 \{a_0 \geq N'\} + 1\{a_1 \geq N'\} - \text{corr} \} \cdot N_0,$$

$$B = \text{idiv}(a_0 - 1\{a_0 \geq N'\} \cdot N_0, d) + \text{idiv}(a_0 - 1\{a_1 \geq N'\} \cdot N_0, d),$$

$$C = \{A < d\} + \{A < d\} + \{A < -d\}.$$ 

Then, $\text{rdiv}(a,d) = \text{idiv}(a_0, d) + \text{idiv}(a_1, d) + (\text{corr} \cdot N_1 + 1 - C - B) \mod N$.

In the FSS setting, the dealer holds $r_{\text{in}}, r_{\text{out}} \in S_N$, while the two parties $P_0$ and $P_1$ hold $x \in S_N$, with the goal of computing $\text{idiv}(x - r_{\text{in}}^d, d) + r_{\text{out}}$. We will set $(a_0) = x$ and $(a_1) = -r_{\text{in}}$ in Theorem 6 (i.e. $a = x - r_{\text{in}}$ mod $N$).

We will first compute $A$ in the above theorem. Let $w = 1\{a_0 + a_1 \geq N\}$, then, as per [31], $\text{corr} = 1\{a_0 \geq N'\} + 1\{a_1 \geq N'\} - (w - 1\{a_0 \geq N'\})$. Now, using DCF$_{n, S_N}$, the parties can compute shares of $w = 1\{a_0 + a_1 \geq N\} = 1\{N - 1 - a_0 < a_1\}$. Similarly, shares of $1\{a \geq N'\}$ can be computed using the FSS gate for $g_{\text{DCF}}$, which has a total key size of $n(d - 1)$ bits plus $(d - 1)$ times the key size of FSS gate for $g_{\text{DCF}}$, which requires (d - 1) invocations of $\text{Eval}_n^{\text{DCF}}$.

**Theorem 7.** There is a 2-round FSS protocol $(\text{Gen}_{n,d}^{\text{div}}, \text{Eval}_{n,d}^{\text{div}})$ for $g_{\text{div}}$ which has a total key size of $6n$ bits, plus the key size of DCF$_{n, S_N}$, plus the key sizes of FSS gates for $g_{\text{corr}, n, [N/2], N - 1}$ and $g_{\text{CMP}, n}$. This protocol requires 1 invocation of DCF$_{n, S_N}$, 1 invocation of Eval$_{n, [N/2], N - 1}$ and 3 invocations of Eval$_n^{\text{CMP}}$.

Remark. In the special case when $d$ is a power of 2, we have $\text{idiv}(a,d) = (a \gg \log d)$, and it is more efficient to use the (single round) arithmetic right-shift (ARS) gate from [3] to perform signed division.

**Average Pool.** The family of functions $g_{\text{avgpool}}$ to compute the average of $d$ elements is defined as $g_{\text{avgpool}}(x_1, x_2, \ldots, x_d) = \langle \sum_{i=1}^d x_i \rangle / d = \text{rdiv}(\sum_{i=1}^d x_i, d) \in S_N$, where $x_1, x_2, \ldots, x_d \in S_N$. It is straightforward to derive a 2-round FSS protocol for $g_{\text{avgpool}}$ from the protocol for signed division.

**Theorem 8.** There is a 2-round FSS protocol $(\text{Gen}_{n,d}^{\text{avgpool}}, \text{Eval}_{n,d}^{\text{avgpool}})$ for $g_{\text{avgpool}}$ which has the same key size and evaluation cost as $(\text{Gen}_{n,d}^{\text{div}}, \text{Eval}_{n,d}^{\text{div}})$.

**C.2 ReLU, Maxpool and Argmax.** We use the FSS gate for $g_{\text{ReLU, n}}$ from the work of [3] for ReLU. With this gate, one can construct an FSS gate to compute the maximum of two elements by defining the function in terms of ReLU – i.e., $g_{\text{max, n}}(x_1, x_2) = \text{ReLU}(x_1 - x_2) + x_2$. We then build upon this to construct an FSS protocol for Maxpool (i.e. the function that computes the maximum out of $d$ elements) by computing the maximum of 2 elements at a time in a tree-like manner, resulting in $(d - 1)$ comparisons done over $\lceil \log d \rceil$ rounds. Finally, Argmax (that computes the index with the maximum value out of $d$ elements) is computed in a similar manner to Maxpool, in $2\lceil \log d \rceil$ rounds.

**Theorem 9.** There is a $\lceil \log d \rceil$-round FSS protocol $(\text{Gen}_{n,d}^{\text{maxpool}}, \text{Eval}_{n,d}^{\text{maxpool}})$ for maxpool on $d$ elements, which has a total key size of $n(d - 1)$ bits plus $(d - 1)$ times the key size of $\text{FSS}$ gate for $g_{\text{ReLU, n}}$, and requires $(d - 1)$ invocations of $\text{Eval}_n^{\text{ReLU}}$.

**Theorem 10.** There is a $2\lceil \log d \rceil$-round FSS protocol $(\text{Gen}_{n,d}^{\text{argmax}}, \text{Eval}_{n,d}^{\text{argmax}})$ for $g_{\text{argmax}}$ which has a total key size of $n(d - 1)$ bits, plus the key size of FSS protocol for $g_{\text{maxpool, n, d}}$, plus $(d - 1)$ times the key sizes of FSS protocols for $g_{\text{CMP, n}}$ and $g_{x, n}$. The protocol requires $(d - 1)$ invocations of $\text{Eval}_n^{\text{ReLU}}, \text{Eval}_n^{\text{CMP}}$ and $\text{Eval}_n^{\text{x argmax}}$ each.
D Description of Benchmarks

DeepSecure B4. This benchmark from [32] contains 3 fully connected layers, 2 tanh layers with 2000 and 500 instances each and argmax. The uniform bitwidth is set to 16. The input/output scales for tanh are set to 12.

Google-30. This benchmark is an RNN taken from [26] with 99 sigmoid and 99 tanh layers with 100 instances over the Google-30 [40] dataset. The input and output scales for sigmoid are 9 and 14 respectively. The input and output scales for tanh are 9. These two layers operate on bitwidth 16. The network also contains other layers like hadamard product and agrmax.

MiniONN LSTM. This benchmark contains the sigmoid and tanh layers of LSTM from [27] (see Figure 14). It contains a sigmoid layer with 600 instances and a tanh layer with 400 instances. The bitwidth used is 37 and input/output scales are 12 for both layers.

Industrial-72. Since the benchmark is not public, we evaluate the math functions alone for this benchmark as described in [30]. As stated, it contains 7 layers of sigmoid and tanh each with 64 instances in each layer. The bitwidth is used 16 and input/output scales for sigmoid are 8 and 14 respectively. The input/output scales for tanh is set to 8.

Heads. Similar to above, description for this benchmark is not available publicly. This is the only benchmark which contains L2 normalization layers that use reciprocal square root computation. We use this benchmark to evaluate our protocol for this function. The benchmark contains 3 layers of reciprocal square root each with 1200, 1200 and 300 instances. The first and third layers have input/output scales of 12 and 11. The second layer has input/output scales of 10 and 9. The input/output bitwidths are 16 for all layers.

MiniONN CNN. This is a 7 layer CNN benchmark from [27] (see Figure 13) over CIFAR-10 dataset. This CNN was used as one of the benchmarks in [28]. It contains convolutions, ReLU, and Maxpool layers. The fixed bitwidth and scale of 41 and 15 is used.

ResNet-50. This is a 50 layer CNN from [18] over ImageNet [11] dataset. The code is generated from the publicly available ONNX files. Bitwidth of 37 is used, as done in [31] with a scale of 12.

ResNet-18. This is a 18 layer CNN from [18] over the Hymenoptera dataset. The code is generated from publicly available ONNX files. Bitwidth of 32 is used, as done in [34] with a scale of 10.

E Mixed-Bitwidth splines

Figure 6 describes the 3-round FSS protocol for fixed-point mixed-bitwidth spline from Section 5.1 with the text in magenta denoting modifications over the FSS gate for uniform bitwidth splines [3, Figure 6].
Fixed-point mixed-bitwidth spline protocol (Gen\textsuperscript{(mixed-fixed)}-spline)(\(n_x,s_x,n_y,s_y,n_c,s_c\),m,d,\(p_i\),i,eval\textsuperscript{(mixed-fixed)}-spline)(\(n_x,s_x,n_y,s_y,n_c,s_c\),m,d,\(p_i\),i) 

\begin{algorithmic}[1]
  \STATE Set \(N_T = 2^{n_x}\), \(N_O = 2^{n_y}\), \(n = n_c + d \cdot n_x\), \(\text{tr} = s_c + d \cdot s_T - s_O\).
  \STATE Sample random \(r_{\text{temp}} \leftarrow S_N\).
  \STATE \(k_{i,10},k_{11} \leftarrow \text{Gen}_{\text{Ext}}^{\text{N},n}(\lambda, r_{\text{in}}, r_{\text{temp}})\).
  \FOR{\(i \in \{1,\ldots,m\}\), let \(\bar{f}_i\) be the polynomial corresponding to \(f_i\), with coefficients sign extended to \(n\)-bits from \(n_c\) bits.}
    \FOR{\(i \in \{1,\ldots,m\}\), let \(\beta_i = (f'_1,\ldots,f'_{i,0}) \in S^{(d+1)}\), be the coefficient vector of \(f'_i\) s.t. \(f'_i(x) = \bar{f}_i(x - r_{\text{temp}})\).}
    \ENDFOR
    \STATE Set \(\beta = (\beta_1,\ldots,\beta_m) \in S^{(N-1)}\) and \(\gamma = (N_T - 1) + r_{\text{in}} \in S_N\).
    \STATE \(k_{(N-1),1}, k_{1(N-1)} \leftarrow \text{Gen}_{\text{Ext}}^{\text{N},n}(\lambda, \gamma, \beta, S^{m(d+1)})\).
  \ENDFOR
  \STATE Sample random \(r_{\text{temp}2} \in S_N\), and random \(r_{10},r_{11} \leftarrow S_N\) s.t. \(r_{10} + r_{11} = r_{\text{temp}2}\).
  \IF{\(n_O \leq n - \text{tr}\)}
    \STATE \(k_{20},k_{21} \leftarrow \text{Gen}_{n,\text{tr}}^{\text{TR}}(1, \lambda, r_{\text{temp}2}, r_{\text{out}})\).
  \ELSE \(n_O > n - \text{tr}\) then
    \STATE \(k_{20},k_{21} \leftarrow \text{Gen}_{n,\text{tr}}^{\mathcal{A}}(1, \lambda, r_{\text{temp}2}, r_{\text{out}})\).
  \ENDIF
  \STATE Sample random \(r_0,r_1 \leftarrow S_{N_0}\) s.t. \(r_0 + r_1 = r_{\text{out}}\).
  \STATE For \(b \in \{0,1\}\), let \(k_b = k_{b(N-1)}\langle e_{i,b} \rangle | \langle \beta_{i,b} \rangle \rangle | r_b | \langle k_{1b} \rangle | \langle k_{2b} \rangle | r_{1b}\).
  \STATE \text{return} \((k_0,k_1)\).
\end{algorithmic}

Eval\textsuperscript{(mixed-fixed)}-spline(\(n_x,s_x,n_y,s_y,n_c,s_c\),m,d,\(p_i\),i) \((b,k_b,x)\):

\begin{algorithmic}[1]
  \STATE \text{Parse} \(k_b = k_{b(N-1)}\langle e_{i,b} \rangle | \langle \beta_{i,b} \rangle \rangle | r_b | \langle k_{1b} \rangle | \langle k_{2b} \rangle | r_{1b}\) and set \(\bar{x}_b = \text{Eval}_{\text{N},n}^{\text{Ext}}(b, k_{1b}, x)\).
  \STATE \text{Reconstruct} \(\bar{x} = \bar{x}_0 + \bar{x}_1\).
  \FOR{\(i \in \{1,\ldots,m\}\) do}
    \STATE Set \(x_i = x + (N_T - 1 - (p_i - 1 + 1)) \in S_{N_x}\).
    \STATE Set \(s_{(i-1),b}^{(i)}|s_{(i),b}^{(i)} \leftarrow \text{Eval}_{\text{N},n}^{\mathcal{A}}(b, k_{b(N-1)}, x_i)\).
  \ENDFOR
  \FOR{\(i \in \{1,\ldots,m\}\) do}
    \STATE Set \(w_{(i)} = (w_{(i)}^{(0)}, \ldots, w_{(i)}^{(0)}) = e_{x,i} \cdot \beta_{i,b} - s_{(i),b}^{(i)} + s_{(i),b}^{(i)} + e_{i,b}\).
  \ENDFOR
  \STATE Set \(y_b = (t_{(d,b)}, \ldots, t_{(0,b)}) = \sum_{i=1}^{m} w_{b}^{(i)} \in S^{(d+1)}\). Set \(y_b = r_{1b} + \sum_{i=0}^{d}(t_{i,b} \cdot \bar{x}^i) \in S_N\).
  \STATE \text{Reconstruct} \(y = y_b + y_{1b} \in S_N\).
  \IF{\(n_O \leq n - \text{tr}\)}
    \STATE \(z_b = \text{Eval}_{n,\text{tr}}^{\mathcal{TR}}(b, k_{2b}, y_b)\).
  \ELSE \(n_O > n - \text{tr}\) then
    \STATE \(z_b = \text{Gen}_{n,\text{tr}}^{\mathcal{A}}(b, k_{2b}, y_b)\).
  \ENDIF
  \STATE \text{return} \(z_b \mod N_O\).
\end{algorithmic}

Fig. 6. FSS protocol for fixed-point mixed-bitwidth spline \((\text{Gen}^{\text{mixed-fixed}})\text{-spline})(\(n_x,s_x,n_y,s_y,n_c,s_c\),m,d,\(p_i\),i,\(f_i\),i), \(b\) refers to party id.