Secure and Accurate Summation of Many Floating-Point Numbers

Motivated by the importance of floating-point computations, we study the problem of securely and accurately summing many floating-point numbers. Prior work has focused on security absent accuracy or accuracy absent security, whereas our approach achieves both of them. Specifically, we show how to implement floating-point superaccumulators using secure multi-party computation techniques, so that a number of participants holding secret shares of floating-point numbers can accurately compute their sum while keeping the individual values private.


INTRODUCTION
Floating-point numbers are the most widely used data type for approximating real numbers with a wide variety of applications; see, e.g., [30,41,51].A (radix-2) floating-point number  is a tuple of integers (, , ) such that where  ∈ {0, 1} is a sign bit,  is the -bit mantissa (which is also known as the significand), and  is the -bit exponent.
A well-known issue with floating-point arithmetic is that it is not exact.For example, it is known that summing two floating point numbers can have a roundoff error and these roundoff errors can propagate and even become larger than a computed result when performing a sequence of many floating-point additions.For example, floating-point addition is not associative [37].
Floating-point arithmetic has applications in many areas including medicine, defense, economics, and physics simulation (e.g., in the NVIDIA Omniverse [33]).Thus, there is considerable need in computing sums of many floating-point numbers as accurately as possible.For example, the accuracy of any computation that involves high-dimensional dot products or matrix multiplications, such as in machine-learning (see, e.g., [26,32]), depends on the accuracy of computing the sum of many floating-point numbers.Similarly, computations in computational geometry involve computing determinants, whose accuracy also depends on computing the sum of many floating-point numbers; see, e.g., [22,45,49].
In addition, the fact that floating-point addition is not associative presents problems related to the reproducibility of computations; see, e.g., [17,[21][22][23].For example, a secure contract involving the summation of floating-point numbers may need to be verified after it has been signed.And if this depends on the summation of floatingpoint values, performing the summation on different computers could result in different outcomes, which could cause participants to reject an otherwise valid digital contract.
Competing with this issue is that some applications of floatingpoint arithmetic have computer-security requirements, including integrity, confidentiality, and privacy.For example, computing the probability of satellites colliding could involve security and privacy considerations when the satellites belong to competing companies or adversarial nation-states, e.g., see [36].Thus, there is a need for protocols for computing sums of many floating-point numbers as securely as possible.This holds for other domains where computation on private data is performed using floatingpoint arithmetic including applications in medicine and privacypreserving training of machine learning models on distributed sensitive data.
In spite of the importance of accuracy and security for summing floating-point numbers, we are not aware of any prior work that simultaneously achieves both accuracy and security for summing many floating-point numbers.As we review below, there is considerable prior work on methods for accurately summing many floatingpoint numbers, but the methods used do not lend themselves to transformations into secure computations.Likewise, as we also review below, there is considerable prior work on securely computing sums of pairs of floating-point numbers, but these prior methods do not consider the propagation of roundoff errors and can lead to inaccurate results for summing many floatingpoint numbers.Such inaccuracies can arise after adding numbers of significantly different magnitudes, where the values of the largest magnitude have opposite signs and significantly exceed other summation operands.Adding the values one at a time using floating-point addition can therefore leave us with noise, while implementing addition exactly will retain the necessary number of summation bits.Thus, in this paper, we are interested in methods for summing many floating-point numbers that are both secure and accurate.
Related Prior Work.Neal [42] describes algorithms using a number representation called a superaccumulator to exactly sum  floating point numbers, which is then converted to a faithfullyrounded floating-point number.Unfortunately, while Neal's superaccumulator representation reduces carry-bit propagation, it does not eliminate it, as is needed for the purposes of this work.A similar idea has been used in ExBLAS [17], an open source library for floating point computations.Shewchuck [45] describes an alternative representation for exactly representing intermediate results of floating-point arithmetic, but the method also does not eliminate carry-bit propagation in summations; hence, it also does not satisfy our accuracy constraints.In addition to these solutions, there are a number of adaptive methods for exactly summing  floating point numbers using various other data structures for representing intermediate results, which do not consider the security or privacy of the data.Further, these methods, which include ExBLAS [17] and algorithms by Zhu and Hayes [52,53], Demmel and Hida [21,22], Rump et al. [46], Priest [43], Malcolm [39], Leuprecht and Oberaigner [38], Kadric et al. [35], and Demmel and Nguyen [23], are not amenable to conversion to secure protocols with few rounds.
While integer arithmetic in secure multi-party computation has been extensively investigated, secure floating-point arithmetic has only gradually attracted attention in the last decade.Catrina and Saxena [16] extended secure computation from integer pairwise arithmetic to fixed-point pairwise arithmetic and applied it to linear programming [15].Franz and Katzenbeisser [28] proposed a solution, based on homomorphic encryption and garbled circuits, for floating-point pairwise operations in the two-party setting with no implementation or performance results.Aliasgari et al. [3] designed a set of protocols for basic floating-point operations based on Shamir secret sharing and developed several advanced operations such as logarithm, square root and exponentiation of floating-point numbers.Their solution was improved and extended for other settings and applications [2,8,36,47] later.Dimitrov et al. [24] proposed two sets of protocols using new representations to improve efficiency, but did not follow the IEEE 754 standard representation.Archer et al. [6] measure performance of floatingpoint operations in different instantiations using a varying number of computation participants and corruption thresholds.Rathee et al. [44] design secure protocols in the two-party setting and exactly follow the IEEE standard rounding procedure.In addition to the above works on improving efficiency of unary/binary floating-point operations, Catrina [11][12][13] proposed and improved several multioperand operations such as sum, dot-product, and polynomial evaluation.Nevertheless, because their solutions are still based on traditional floating-point pairwise addition, round-off errors accumulate inevitably in each addition operation.
Our Results.In this paper, we develop new secure protocols for summing many floating-point numbers that outperforms other approaches.We design a superaccumulator-based solution that privately and accurately calculates summations of many private arbitrary-precision floating-point numbers, and we empirically evaluate the performance of our solution on varying input sizes and precision.Unlike standard floating-point addition, our approach performs summation exactly without introducing round-off errors.
Our supperaccumulator-based approach and most of the protocols we develop can be instantiated with building blocks based on secret sharing in different settings, including computation with or without honest majority and semi-honest and malicious adversarial models.Some of the design choices are made in favor of reducing communication and one efficient low-level building block, conversion shares of a bit from binary to arithmetic sharing, is in the three-party setting with honest majority based on replicated secret sharing in the semi-honest model (as defined below).We implement the construction in that setting and show that its runtime is faster than the state of the art implementing floating-point operations [12,44].Thus, we are able to implement exact addition while simultaneously improving performance.

FLOATING-POINT SUMMATION CONSTRUCTION 2.1 The Expand-and-Sum Solution
There is a simple naïve solution for exactly summing a set of  floating-point numbers, { 1 ,  2 , . . .,   }, which we refer to as the expand-and-sum solution.It is reasonable for low-precision floatingpoint representations and is given as Algorithm 1.That is, for each floating-point number   , we convert the representation of   into an integer   , with as many bits as is possible based on the floatingpoint type being used for the   s.Then we sum these values exactly using integer addition and convert the result back into a floatingpoint number.
The   s would have the following sizes based on the IEEE 754 formats: • Half : a half-precision floating-point number in the IEEE 754 format has 1 sign bit, a 5-bit exponent, and a 10-bit mantissa.Thus, representing this as an integer requires 1 + 2 5 + 10 = 43 bits.Further, there are also even higher-precision floating-point representations, which would require even more bits to represent as fixed-precision or integer numbers; see, e.g., [10,27,29,31,50].Implementing a summation using this representation would involve performing many operations on very large numbers using secure multi-party computation techniques, thus degrading performance.
Of course, applications with high-precision floating-point numbers are likely to be applications that require accurate summations; hence, we desire solutions that can work efficiently for such applications without requiring ways of summing very large integers.
In particular, summing very large integers requires techniques for dealing with cascading carry bits during the summations, and performing all these operations securely is challenging for very large integers.Thus, we consider this expand-and-sum approach for summing  floating-point numbers as integers to be limited to low-precision floating-point representations.

Superaccumulators
An alternative approach, which is better suited for use with conventional secure addition when applied to high-precision floating-point formats, is to use a superaccumulator to represent floating-point summands, e.g., see [17,18,42].This approach also uses integer arithmetic, but with much smaller integers.More importantly, it can be implemented to avoid cascading carry-bit propagation.
In a superaccumulator, instead of representing a floating-point number as a single expanded (very-large) integer, we represent that integer as a sum of small components maintained separately.That is, we represent the expanded integer , corresponding to a floatingpoint number , as a vector of 2-bit integers ⟨  ,  −1 , . . .,  1 ⟩, where  =  =1 2     and  = ⌈ 2  +  ⌉, so that we cover all possible exponent values.Also, note that if we convert a floating-point number to a superaccumulator, then at most  = ⌈ +1  ⌉ + 1 of the entries will be non-zero.We can choose  based on the underlying mechanism for achieving security and privacy.For example, if we want to use built-in 64-bit integer addition, we can choose  to be 32.In addition, we say that  is regularized if −2  <   < 2  for  = 1, . . ., 𝛼.At a high level, in our scheme, we start with a regularized representation for each floating-point number   , and then we perform summations on an element-by-element basis.Finally, we regularize the partial sums by shifting "carry" values to neighboring elements.As we show, this approach allows us to prevent these carry values from propagating in a cascading fashion after performing a group of sums, which allows us to achieve efficiency for our secure summation protocols.
As we show, because of the way that we regularize superaccumulators, the "carry" values,   , will not propagate in a cascading way, and the result of the above summation will be regularized.This allows us to complete the sum in a single communication round.
Further, for practical values of , the constraint that  ≤ 2 −2 is not restrictive.For example, if  = 32, this implies we can sum up to one billion floating-point numbers in a single communication round.Thus, to sum larger groups of numbers, we can group the summations in a tree where each internal node has 2 −2 children, and perform the sums in a bottom-up fashion.The important property, though, is that performing the above approach of summing  ≤ 2 −2 regularized superaccumulators and then adding the carry values,   (some of which may be negative), to the neighboring element will result in a regularized superaccumulator.The following theorem establishes this property.Theorem 2.1.If  ≤ 2 −2 , then summing  regularized superaccumulators using the above algorithm will produce a regularized result.

SECURE COMPUTATION PRELIMINARIES 3.1 Security Setting
We use a conventional secure multi-party setting with  parties running the computation,  of which can be corrupt.Given a function  to be evaluated, the computational parties securely evaluate it on private data such that no information about the private inputs, or information derived from the private inputs, is revealed.More formally, a standard security definition requires that the view of the participants during the computation is indistinguishable from a simulated view generated without access to any private data.
Most of the protocols developed in this work can be instantiated in different adversarial models, but our implementation and one low-level building block are in the semi-honest model, in which the participating parties are expected to follow the computation, but might try to learn additional information from what they observe during the computation.Then the security requirement is that any coalition of at most  conspiring computational parties is unable to learn any information about private data that the computation handles.Achieving security in the semi-honest setting first is also important if one wants to have stronger security guarantees, and many of the protocols developed in this work would also be secure in the malicious model when instantiated with stronger building blocks.
The focus of this work is on precise (privacy-preserving) floatingpoint summation, and this operation is typically a part of a larger computation.For that reason, the inputs into the summation would be the result of other computations on private data.Therefore, we assume that the inputs into the summation are not known by the computational parties and are instead entered into the computation in a privacy-preserving form.Similarly, the output of the summation can be used for further computation and is not disclosed to the parties.In other words, we are developing a building block that can be used in other computations, where the computational parties are given privacy-preserving representation of the inputs, jointly produce a privacy-preserving representation of the output, and must not learn any information about the values they handle.This permits our solution to be used in any higher-level computation and abstracts the setting from the way the inputs are entered into the computation (which can come from the computational parties themselves or external input providers).
In our solution, we heavily rely on the fact that composition of secure building blocks is also secure.As part of this work, we develop several new building blocks to enable the functionality we want to support.

Secret Sharing
To realize secure computation, we utilize ( , )-threshold linear secret sharing.Secret sharing offers efficiency due to the informationtheoretic nature of the techniques and consequently the ability to operate over a small field or ring.Many of the protocols developed in this work can be realized using any suitable type of secret sharing (e.g,.with or without honest majority and in the semihonest or malicious settings) and by [] we denote a secret-shared representation of value , which is an element of the underlying field or ring.The expected properties are that (i) each of the  computational parties   holds its own share such that any combination of  shares reveals no information about  and (ii) a linear combination of secret-shared values can be computed by each party locally on its shares.SPDZ 2  [19] is one example of a suitable framework.
For performance reasons, many recent publications utilize computation over ring Z 2  for some  ≥ 1, which permits the use of native CPU instructions for performing ring operations.This is also the setting that we utilize for our experiments and use to inform certain protocol optimizations.Conventional techniques such as Shamir secret sharing [48] cannot operate over Z 2  and thus we rely on replicated secret sharing [34] with a small number of parties.Specifically, we use the setting with honest majority, i.e., where  <  /2, and are primarily interested in the three-party setting, i.e.,  = 3.All parties  1 , . . .,   are assumed to be connected by pair-wise secure authenticated channels.
There is a need to secret share both positive and negative integers and the space is used to naturally represent all values as nonnegative ring/field elements.In that case, the most significant bit of the representation determines the sign.
For efficiency reasons, portions of the computation proceed on secret shared values set up over a different ring, most commonly Z 2 .Thus, we use notation [] ℓ to denote secret sharing over Z 2 ℓ when ℓ differs from the default .

Building Blocks
In a linear secret sharing scheme, a linear combination of secretshared values can be performed locally on the shares without communication.This includes addition, subtraction, and multiplication by a known element.Multiplication of secret-shared values requires communication and the cost varies based on the setting.We use the multiplication protocol from [7] that works with any number of parties in the honest majority setting and communicates only one element in one round in the three-party setting, i.e., when  = 3, it matches the cost of three-party protocols such as [5].Realizing the dot product operation can also often be performed with the communication cost of a single multiplication, regardless of the size of the input vectors.
Our computation additionally relies on the following common building blocks: • Equality.When working with positive and negative values, MSB computes the sign and is equivalent to the less-than-zero operation.For that reason, the operation can also be used to compare two integers [] and [𝑦] by supplying their difference as the input into the function.We use the protocol from [7].
performs bit decomposition an ℓ-bit input [] and outputs ℓ secret-shared bits.Our implementation uses the protocol from [20], with a modification that random bit generation is based on edaBits (see below) and the output bits are secret shared over Z 2 by skipping their conversion to Z We invoke this function only on non-negative inputs .Our implementation augments randomized truncation TruncPr from [7] with BitLT implemented using a generic carry propagation mechanism.  .This is the same as   =  =1   when   s are binary.PrefixAND can be realized as described in [14] using a generic prefix operation procedure (when operating over a ring).As the inputs are bits, for performance reasons this protocol is carried out in This operation can also be implemented using a generic prefix operation mechanism and executed over Z 2 .
takes  bits and produces 2  bits   of the form −1 =0   , where each   is either   or its complement ¬  and the protocol enumerates all possible combinations.The important property is that only one element at position  = −1 =0 2    in the output array will be set to 1, while the remaining elements will be 0. The protocol is described in [9], which we implement over a ring.We also develop several other building blocks as described in Section 4. Note that many of these building blocks can be implemented using different variants, where the mechanism for random bit generation plays a particular role.Using the edaBit approach as described above lowers communication cost of protocols compared to generating each random bit separately with shares in Z 2  , but incurs a higher number of communication rounds.We make design choices in favor of lowering communication, but the alternative is attractive when summing a small number of inputs or when the latency between the computational nodes is high.Notation ← is used for functionalities that draw randomness (to produce randomized output or to compute a deterministic functionality that internally uses randomization) and notation = is used for deterministic computation.

SECURE LARGE-PRECISION CONSTRUCTION
We are now ready to proceed with our solution for secure and accurate floating-point number summation based on the superaccumulator structure of Section 2.  1) and are implicit inputs.
When constructing a privacy-preserving solution, the computation that we perform must be data-independent or data-oblivious, as not to disclose any information about the underlying values.In the context of working with the superaccumulator representation, we need to be accessing all superaccumulator slots in the same way regardless of where the relevant data might be located.In particular, when converting a floating-point value to a superaccumulator, at most  slots will contain non-zero values, but their location cannot be disclosed.Similarly, when converting a regularized superaccumulator corresponding to the sum to its floating-point representation, only most significant non-zero slots are of relevance, but we need to hide their position within the superaccumulator.
It is important to note that, unless specified otherwise, the computation is performed over 2-bit shares (or ring Z 2 2 in our implementation) to facilitate superaccumulator operations.We denote the default element bitlength by .This default bitlength is sufficient to represent all values with a single exception: the bitlength  mantissa  in the floating-point representation can often exceed the value of 2.For that reason, we represent mantissa  as as a sequence of ⌈ +1  ⌉, or  − 1, secret-shared blocks storing  bits of  per block.For clarity of exposition, each   is written as a single shared value in FLSum, while in the more detailed protocols that follow we make this representation explicit.
For most protocols in this paper, including FLSum in Algorithm 2, security follows as a straightforward composition of the building blocks assuming that the sub-protocols are themselves secure.Then using a standard definition of security that requires a simulator without access to private data to produce corrupt parties' view indistinguishable from the protocol's real execution, we can invoke the simulators corresponding to the sub-protocols and obtain security of the overall construction.Thus, in the remainder of this work we discuss security of a specific protocol only when demonstrating its security involves going beyond a simple composition of its sub-protocols.In addition, for some protocols it is important to ensure that they are data-oblivious (i.e., data-independent) such that the executed instructions and accessed memory locations are independent of private inputs.Data obliviousness is necessary for achieving security because we need the ability to simulate corrupt parties' view without access to private data.

Floating-Point to Superaccumulator Conversion
The first component is to convert floating-point inputs to their superaccumulator representation.Because this operation is rather complex and needs to be performed for each input, it dominates the cost of the overall summation and thus it is important to optimize the corresponding computation.The conversion procedure takes a floating point value 1 and needs to produce a regularized superaccumulator as a vector of  2-bit integers, where  = ⌈ 2  +  ⌉. 4. 1.1 The Overall Construction.To perform the conversion, the computation needs to determine the position within the superaccumulator where the mantissa is to be written based on exponent [], represent the mantissa as  superaccumulator blocks, and write the blocks in the right locations without disclosing what locations within the superaccumulator those are.The protocol details are given as protocol FL2SA (Algorithm 3), which we consequently explain.
Recall that the superaccumulator's step is 2  .This means that  − log  most significant bits of the exponent [] represent the index of the first non-zero slot in the accumulator.The log  least significant bits of the exponent are used to shift the mantissa so that it is aligned with the block representation of the superaccumulator.Thus, in the beginning of FL2SA we divide the exponent [] into two parts: the most significant  − log  are denoted by  high and the remaining log  bits are denoted by  low (lines 1-2).
The next task is to use the mantissa (represented as  − 1 blocks) and [ low ] to generate  superaccumulator blocks.First, recall that normalized floating-point representation assumes that the most significant bit of the mantissa is 1 and is implicit in the floating-point representation.Thus, we need to prepend 1 as the ( + 1)st bit of .In FL2SA we do this conditionally only when the exponent is nonzero (lines 3-4) because when  = 0, normalization might not be possible (e.g., if the floating-point value represents a zero).Second, we need to shift the updated mantissa blocks by a private log -bit value   to be aligned with the boundaries of superaccumulator blocks and update each value to be  bits by carrying the overflow into the next block.
To perform re-partitioning, we considered solutions based on bit decomposition and truncation for re-partitioning the blocks, and the second approach was determined to be faster.Our final solution -a protocol called Shift that takes the original mantissa blocks -right shifts the values by private [ low ] positions, where  is the upper bound on the amount of shift, and re-aligns the blocks to contain  bits each using truncation.The details of the Shift protocol are deferred to the next sub-section.After producing the superaccumulator blocks (line 5), we update the sign of each block using bit [] (lines 6-8).The desired superaccumulator representation is depicted in Figure 1, where the produced superaccumulator blocks are intended to be written in positions  high + 1 through  high + .
The last task is to write the generated  superaccumulator blocks [  ]s into the right positions of our -block superaccumulator, as specified by the value of [ high ].Because the computation must be data-oblivious, the location of writing cannot be revealed and the access pattern must be the same for any value of  high .To accomplish the task, we considered two possible solutions: (i) turning the value of  ℎℎ into a bit array of size  with the  high value set to 1 and all others set to 0 and using the bit array to create superaccumulator blocks and (ii) creating a bit array with a single 1 in the first location and rotating the bit array by a private amount  high .The first approach was determined to be faster and we describe it next.
The conversion of [ high ] + 1, the value of which ranges between 1 and , to a bit array of private bits with the ( high +1)th bit set to 1 can be viewed as binary to unary conversion, denoted by B2U.Prior work considered this building block, and specifically in the context of secure floating-point computation [3], but prior implementations were over a field.Because computation over a ring of the form Z 2  can be substantially faster, we design a new protocol suitable over a ring using recent results, as described later in this section.After the binary-to-unary conversion of  high + 1 (line 9 of FL2SA), each slot of the superaccumulator [  ] is computed as the dot product of the previously computed data blocks [  ]s and at most  bits [  ]s (lines 10-18) because the data blocks need to be written at positions  high −  + 1 through  high .In particular, for the middle superaccumulator blocks, there are  bits and data blocks to consider when creating each superaccumulator block [  ], while the boundary blocks would iterate over fewer options.For example, the block [ 1 ] will be updated to Superaccumulator []: Figure 1: Illustration of floating-point to superaccumulator conversion.
, respectively.All superaccumulator blocks are updated in parallel with communication cost equivalent to that of  multiplications.It is implicit in the interface specification that each original block representation has (at least)  unused bits, so that the content of each block can be shifted by up to  positions without losing information.In particular, we assume that each block has  bits occupied, so that after the shift the intermediate result can grow to 2 bits before being reorganized to occupy  bits per block.
The computation, given in Algorithm 4, starts by bit-decomposing the private amount of shift [] and converting the resulting bits to ring elements (lines 1-5).The content of each block [  ] is shifted right (as multiplication by a power of 2) by the appropriate number of positions depending on the value of each bit of the amount of shift: when bit [  ] is 0, the value is multiplied by 1; otherwise it is multiplied by a power of 2 that depends on the index  (lines 6-8).We then truncate each shifted block (line 9) to split the value into the least significant  bits that the block will retain and the most significant  bits which will are the carry for the next block.Each block is consequently updated by taking the carry from the prior block and keeping its  least significant bits (lines 11-15).Because we shift all blocks in the same way, this operation corresponds to a shift with block re-aligning on the boundary of  bits per block.
Our ring-based solution for binary-to-unary conversion B2U takes a private integer [] and public range, where 0 <  ≤ ℓ, and produces a bit array ⟨[ 1 ], . . ., [ ℓ ]⟩ with the th bit set to 1 and all other bits set to 0. Our goal is to have a variant suitable for computation over ring Z 2  using most efficient currently available tools.Our solution, shown as Algorithm 5, is based on ideas used for retrieving an element of an array at a private index in [9].
The high-level idea consists of generating ⌈log ℓ⌉ random bits [  ] that collectively represent a random ⌈log ℓ⌉-bit integer [ ], generating ⌈log ℓ⌉-ary ORs of [ ]− for all log ℓ-bit  and flipping the resulting bits.This creates a bit array with all values set to 0 except the element at private location [ ] set to 1.The ORs are computed simultaneously for all values using protocol AllOr.Consequently, the algorithm opens the value of  =  + (modulo 2 ⌈log ℓ ⌉ ) and uses the disclosed value to position the only 1 bit of the array in location  (i.e., the bit will be set at position  for which  −  =  +  −  =  ).
Note that the protocol explicitly calls edaBit for random bit generation (and inherits its properties) and there are alternatives.We enhance performance by carrying out the most time-consuming portion of the computation, namely AllOr, over a small ring Z 2 because the computation uses Boolean values.This means that after producing 2 ⌈log ℓ ⌉ bits through a sequence of calls to edaBit, AllOr, and Open and array rotation, we need to convert their shares from Z 2 to Z 2  , which we do using binary-to-arithmetic share conversion B2A (line 6).In addition, reconstruction of  =  +  on line 4 needs to be performed using -bit shares to enforce modulo reduction and prevent information leakage, where share truncation prior to the reconstruction is performed by Open itself using the modulus specified as the second argument.
As far as security goes, we note that besides composing subprotocols the protocol also reconstructs a value which is a function of private input [] on line 8. Security is still achieved because [ ] is a private value uniformly distributed in Z 2  .Thus, the value of [] is perfectly protected and the opened element of Z 2  is also from party  3 on behalf of party  2 .
In the beginning of the computation,  3 has access to [𝑎] (1)

, [𝑏]
(2) 1 , G 1 , and G 2 .It then receives a random [𝑢] (2)  from  3 in the simulated view, while in the real execution the value is computed as  ′ +  ′′ , where  ′′ = G 3 .next.Due to security of the PRG, its output is pseudo-random and information-theoretically protects  ′ .We obtain that the value  3 is indistinguishable from a truly random string to a computationally-bounded  3 .Thus, we obtain that  3 's views in real execution and simulation are computationally indistinguishable.
We conclude that our B2A protocol is secure in the presence of a single semi-honest adversary.□ B2A is an important building blocks of many other protocols including truncation, ring conversion, bit decomposition, etc.Thus, the above efficient three-party B2A impact performance of the computation.For that reason, we analyze performance of building blocks and our protocols in the three-party setting using RSS as given in Table 2.Note that we separate input-independent computation that can be pre-computed and the remaining (inputdependent) computation.
Random bit generation [ ] ← RandBit (as used, e.g., in MSB) is implemented by using local randomness to generate shares of [ ] 1 over Z 2 and converting them to the larger ring using B2A.We favor the use of edaBit in sub-protocols in place of conventional RandBit random bit generation.This lowers the amount of communication, but increases the number of rounds.
The cost of AllOr as specified in [9] varies based on the size given as an input.For that reason, in Table 2 we list a range of constants for values  used with single and double precision in this work (the smallest  = 9 with single precision and  = 32 results in constant 1.5 and the largest  = 132 with double precision and  = 16 results in constant 1.2).

Superaccumulator Summation
Once we convert the floating-point inputs into superaccumulators, the next step is to do the summation and regularize the result.This corresponds to the protocol SASum given in Algorithm 7  (line 2).The remaining computation regularizes the resulting superaccumulator.We first compute the absolute value of each block   (lines 3-4) and then split the result into  most significant bits (carry for the next block [ +1 ]) and  least significant bits ([  ]) using truncation (lines 5-6).The final block value is assembled from the carry of the prior block and the remaining portion of the current block using their corresponding signs (line 7).The carry into block 1 is 0.
Recall that each superaccumulator block is represented as a 2bit integer and we can add at most  = 2 −2 inputs without an overflow.If one needs to sum more than 2 −2 inputs, the computation will proceed in layers, where we first sum accumulators in batches of 2 −2 , regularize the result and then do another layer of summation and regularization to arrive at the final regularized superaccumulator.

Superaccumulator to Floating-Point Conversion
What remains to discuss is the conversion of the regularized superaccumulator representing the summation to the floatingpoint representation.To maintain security, our protocols needs to obliviously select  superaccumulator blocks starting from the first non-zero block without disclosing the location of the selected blocks.In the event that there are fewer than  blocks to extract, the solution will still return  blocks.
The superaccumulator to floating-point conversion protocol SA2FL is given as Algorithm 8 and proceeds as follows.Let ind denote the (private) index of the first non-zero superaccumulator block.We restrict the value we work with to be in the range , .size Z 2 +2 , we set the computation over Z 2  = Z 2 60 and thus portion of the computation for the protocol from [20] are over Z 2 62 .The implications are that both protocols can internally use 64-bit arithmetic and the increase in the ring size does not impact communication in bytes.Therefore, communication and the number of rounds of the protocol from [20] are also the same as those numbers for the protocol from [4].Had we chosen  = 32 or  = 64, the gap in performance between our protocol and that from [20] would increase due to the need of the later to increase the communication size and use a longer data type for the computation.
As we see from Figure 2, for smaller input sizes, both solutions exhibit similar performance due to their equivalent round complexity.However, as input size increases beyond 2 6 and communication and computation become dominant factors in overall performance, our solution outperforms [20] by a significant margin.For instance, the performance gap between the two approaches is as large as a factor of four for input size 2 20 , demonstrating the advantage of our B2A protocol even beyond savings in communication.
Performance of our superaccumulator-based floating-point summation for single and double floating-point precision is provided in Table 3.The performance is additionally visualized in Figures 3 and 4. We see that the bottleneck of the summation for both single and double precision is the conversion FL2SA, particularly when the input size  is large.This is expected because we need to convert all  inputs into the superaccumulator representation.In contrast, superaccumulator to floating-point conversion SA2FL has a constant runtime for all input sizes because we only need to convert a single result and the workload does not change.Although summation SASum has communication complexity independent of , its local computation linearly depends on the input size, which makes its runtime increase with .
If we compare the runtimes for different values of , using  = 16 results in lower overall runtime with single precision, while  = 32 is superior for double precision.The difference in performance mainly stems from the impact of the choice of  on the performance of FL2SA and its dependence on parameters  and  (which  directly influences).
We also compare performance of our superaccumulator-based solution with floating-point summations from [11,12,44].We execute SecFloat's [44] pairwise addition in a tree-like manner to realize floating-point summation and measure the performance on our setup.Note that SecFloat is for the two-party setting (dishonest majority) and was implemented only for single precision.We also include published runtimes of the best performing solution, SumFL2, from [12] as the implementation has not been released.The experiments in [12] were run using three 3.6GHz machines connected via a 1Gbps LAN, where the round-trip time (RTT) measured via ping was reported to be 0.35ms (our RTT measured via ping averaged at 0.25ms).We also calculate the communication cost of SumFL2 using the specified formula. 1 .The results are given    4: Runtime comparison with SumFL2 from [12] in ms.
in Figure 3, where our single-precision solution uses  = 16.As shown in the figure, our protocol has better runtime and communication costs than the other two solutions.Although [44] states that their implementation is not optimized for batch sizes smaller than 2 10 , our protocol is still 5 times faster and uses 17 times less communication than [44] with 2 18 inputs.For input sizes larger than 2 14 , both solutions demonstrate the same trend.We expect our advantages would be larger in the WAN setting where bandwidth is limited and communication is the bottleneck.
Compared to [12], our best performing configuration has a better runtime despite running on slower machines, as additionally shown in Table 4.In [12], performance is reported with at most 100 inputs.When  = 100, our solution demonstrates the largest improvement, being 5 times faster than SumFL2 from [12] for both single and double precisions.We expect the improvement to be even larger as the number of inputs increases.Furthermore, we note that our solution enjoys higher precision, as the goal of this work was to provide better precision than what is achievable using conventional floating-point addition.Lastly, while [13] discussed additional optimizations to floating-point polynomial evaluation, it is difficult to extract times that would correspond to the summation.

CONCLUSIONS
The goal of this work is to develop secure protocols for accurate summation of many floating-point values that avoid round-off errors of conventional floating-point addition.Our solution uses the notion of a superaccumulator and the computation proceeds by converting floating-point inputs into superaccumulator representation, performing exact summation, and converting the computed result back to a floating-point value.Despite providing higher accuracy, we demonstrate that our solution outperforms state-ofthe-art secure floating-point summation.

4. 1 . 2
New Building Blocks.What remains is to describe our Shift and B2U protocols.The Shift protocol takes an integer value (mantissa in the context of this work) stored in  − 1 blocks [ −1 ], . . ., [ 1 ], shifts the value right by private amount specified by the second argument [], where the value of  ranges between 0 and  specified by the third argument, and outputs  new blocks [  ], . . ., [ 1 ].

Figure 3 :
Figure 3: Performance comparison with related work for single precision.[9]'s runtime uses different hardware.