HeLayers: A Tile Tensors Framework for Large Neural Networks on Encrypted Data

Privacy-preserving solutions enable companies to offload confidential data to third-party services while fulfilling their government regulations. To accomplish this, they leverage various cryptographic techniques such as Homomorphic Encryption (HE), which allows performing computation on encrypted data. Most HE schemes work in a SIMD fashion, and the data packing method can dramatically affect the running time and memory costs. Finding a packing method that leads to an optimal performant implementation is a hard task. We present a simple and intuitive framework that abstracts the packing decision for the user. We explain its underlying data structures and optimizer, and propose a novel algorithm for performing 2D convolution operations. We used this framework to implement an HE-friendly version of AlexNet, which runs in three minutes, several orders of magnitude faster than other state-of-the-art solutions that only use HE.


INTRODUCTION
Homomorphic Encryption (HE) schemes allow computations to be performed over encrypted data while providing data confidentiality for the input. Specifically, they allow the evaluation of functions on encrypted input, which is useful when outsourcing sensitive data to a third-party cloud environment. For example, a hospital that provides an X-ray classification service (e.g., COVID-19 versus pneumonia) can encrypt the images using HE, express the classification algorithm as a function, and ask a cloud service to evaluate it over the encrypted data without decrypting it. In this way, the hospital can use the cloud service while possibly complying with regulations such as HIPAA [9] and GDPR [18]. The proliferation of HE solutions in the last decade [20] shows that customers are eager to use them and that companies and organizations strive to provide secure and efficient solutions such as HEBench [2]. Nevertheless, it turns out that running large Neural Networks (NNs) using HE only is still considered an expensive task. For example, the best implementations [19] of AlexNet [31] with large inputs of 224 × 224 × 3 before this paper, was measured as taking more than 3 hours using only 80 security bits. This barrier forces users to search for other secure alternatives instead of enjoying the advantage of solutions that rely only on HE. Our proposed framework aims to narrow down this barrier, allowing the users to better utilize cloud capabilities while operating on their confidential data.
Some HE schemes, such as CKKS [13], operate on ciphertexts in a homomorphic Single Instruction Multiple Data (SIMD) fashion. This means that a single ciphertext encrypts a fixed size vector, and the homomorphic operations on the ciphertext are performed slotwise on the elements of the plaintext vector. To utilize the SIMD feature, we need to pack and encrypt more than one input element in every ciphertext. The packing method can dramatically affect the latency (i.e., time to perform computation), throughput (i.e., number of computations performed in a unit of time), communication costs, and memory requirements. The designer of an HE-based system is usually tasked with choosing the ciphertext-size. The size is bounded from above by the FHE-library implementation and from below by security parameters. Once selected, the ciphertext-size stays fixed for the entire process (with different packing options). To demonstrate the effect of different packing choices we use Cryp-toNets [21]. We summarize the results in Table 2, and observe that using two naïve packing solutions achieved latencies of 0.86 sec. and 11.1 sec., with memory usage of 1.58 GB and 14 GB, respectively. In comparison, a different non-trivial packing method achieved better latency of 0.56 sec. and memory usage of 0.73 GB. Section 7 provides more details about these three packing methods.
We also demonstrate why packing is hard by presenting the problems involved in encrypting and packing a matrix. A naïve packing option may keep every row or column in a different ciphertext. When the evaluated algorithm performs manipulations over rows, one packing option is clearly better than the other. However, when the matrix dimensions are small, the encrypted ciphertext may include many unused ciphertext slots. With the cost of performing extra rotations, it is possible to pack more than one row in a single ciphertext.
Designing a good packing method is not straightforward (e.g., [16,29,44]) and the most efficient packing method may not be the trivial one (see above). Moreover, different optimization goals may lead to different packing, e.g., as shown in Table 3. As the size of the HE code increases, it becomes harder to find the optimal packing. For example, finding the best packing for a large NN inference algorithm is hard since the input is typically a four or five dimensional tensor, and the computation involves a long sequence of operations such as matrix multiplication and convolution.
Non-interactive HE. Two approaches for HE computations are client-aided and non-client-aided. In the client-aided approach, during the computation the server asks the data owner or the end-user for assistance. I.e., the user is asked to decrypt the intermediate ciphertext results, perform some minor computation tasks, and re-encrypt the data using HE. This approach was implemented in GAZELLE [29] and NGraph [8] using Multi Party Computations (MPC). It has the drawback that the client must stay online during the computation. Moreover, this setting poses some security concerns [4] or can even make it easier to perform model-extraction attacks [35]. To avoid these limitations, we focus on the non-clientaided approach where the computation is done entirely under encryption, without interaction.
Using non-client-aided designs require using HE-friendly NN architectures; these replace nonlinear layers such as ReLU and MaxPooling with other functions. Today, these conversions have become common practice (see survey [43]), and they present a time versus accuracy tradeoff that is mostly analyzed in AI-related works (e.g., [6]). Our framework can work with any polynomial activation function. Hence, our current and future users can decide how to balance time and accuracy by choosing the HE architecture that best suits their needs. We leave the accuracy discussion outside the scope of this paper. Here, we emphasize that our security-oriented framework leverages the SIMD property of HE-schemes, which can independently benefit from any AI-domain improvement in model accuracy.
Using HE-friendly models may require retraining models before using them. Nevertheless, we argue that the proliferation of the HEdomain, together with future AI improvements, will bring about more solutions that prefer training HE-friendly models directly, from day one. With our focus on non-client-aided designs, we provide customers with better security guarantees and improved client usability. The potential small cost in model accuracy is expected to get smaller over time.
Some recent HE compilers [8,17] simplify the way users can implement NN solutions on encrypted data by allowing them to focus on the network and leaving the packing optimizations to the compilers. This is also the purpose of our tile tensor framework. It enables us to evaluate an HE-friendly version [6] of AlexNet [31] in only five minutes. To the best of our knowledge, this is the largest network to be implemented with a feasible running time, 128-bit security, in a non client-aided mode, and without bootstrapping (see [23]). In comparison, NGraph [8] reported their measurements for CryptoNets [21] or for MobileNetV2 [41] when using client-aided design, and CHET [17] reported the results for SqueezeNet [26], which has 50× fewer parameters than AlexNet. Another experiment using NGraph and CHET was reported in [43] using Lenet-5 [32], which is also a small network compared to AlexNet. See a complete comparison at Section 8.

Our Contribution
Our contributions can be summarized as follows: • A tile tensor based framework. We introduce a new packingoblivious programming-framework that allows users to concentrate on the NN design instead of the packing decisions. This framework is simple and intuitive, and is available for non-commercial use in [39]. • Tile tensors. We introduce tile tensors, a new generalized packing method that intuitively covers many of the previous packing optimizations. A tile tensor offers the user the API of an ordinary tensor, and is implemented using elegant algorithms. • Packing optimizer. We describe a packing optimizer that considers many different packing options. The optimizer estimates the time and memory needed to run a given function for every option, and reports the one that performs best per a given objective, whether latency, throughput, or memory. • A new method to compute convolution. We provide a new packing method and a new implementation of the 2D convolutional layer, which is a popular building block in NNs.
Our new packing and implementation are more efficient for large inputs than previous work. In addition, with this packing, we are able to efficiently compute a long sequence of convolution-layers. • Efficient HE-friendly version of AlexNet inference under encryption. We implemented an HE-friendly version of AlexNet.
To the best of our knowledge, this is the fastest non-clientaided evaluation of this network. • Packing notation. We present a language for describing packing details. It covers several known packing schemes, as well as new ones, and allows easy and intuitive HE circuit design.
The rest of the paper is organized as follows. Section 2 describes the notation used in the paper, and some background terminology. Section 3 provides an overview of the tile tensor framework. We describe the tile tensors data structure in Section 4 and the packing optimizer in Section 6. Section 5 describes our novel convolution algorithm, and Section 7 reports the results of our experiments when using CryptoNets and AlexNet. In Section 8, we provide an extended comparison of our methods to the state-of-the-art methods. Section 9 concludes the paper.

BACKGROUND 2.1 Notation
We use the term tensor as synonymous with multi-dimensional array, as this is common in the AI domain. We denote the shape of a k-dimensional tensor by [n 1 , n 2 , . . . , n k ], where 0 < n i is the size of the i'th dimension. For example, the shape of the 5 × 6 matrix M is [5,6]. We sometimes refer to a tensor M by its name and shape M [5,6] or just by its name M when the context is clear. For a tensor R[n 1 , . . . , n k ], we use R(j 1 , j 2 , . . . , j k ) to refer to a specific element, where 0 ≤ j i < n i . We use uppercase letters for tensors.
We write matrix multiplication without a multiplication symbol, e.g., M 1 M 2 stands for the product of M 1 and M 2 . We denote the transpose operation of a matrix M by M T and we use tags (e.g., M ′ , M ′′ ) to denote different objects. The operations M 1 + M 2 and M 1 * M 2 refer to element-wise addition and multiplication, respectively. The broadcasting operation takes two tensors with compatible but different shapes and expands every one of them to their mutual expanded shape.

Tensor Basic Operations
Definition 2.2 (Broadcasting). For a tensor A[n 1 , . . . , n k ] and a tensor shape s = [m 1 , . . . , m k ] with n i ∈ {1, m i } for each i = 1, . . . , k, the operation C = broadcast(A, s) replicates the content of A along the r dimension m r times for every r = 1, . . . , k and n r = 1 < m r . The output tensor C is of shape s. Example 1. The tensors A[3, 4, 1] and B [1,4,5] have compatible shapes. Their mutual expanded shape is s = [3,4,5] and broadcast(A, s) has the same shape s as broadcast(B, s).
We perform element-wise operations such as addition (A + B) and multiplication (A * B) on two tensors with compatible shapes A, B by first using broadcasting to expand them to their mutual expanded shape and then performing the relevant element-wise operation.
Definition 2.3 (Summation). For a tensor A[n 1 , . . . , n k ], the operation B = sum(A, t) sums the elements of A along the t-th dimension and the resulting tensor B has shape [n 1 , . . . , n t −1 , 1, . . . , n k ] and for an RGB image). In addition, we compute the convolution for a batch of b images and for f filters. Informally, the convolution operator moves each filter in F as a sliding window over elements of I starting at position (0, 0) and using strides of δ w and δ h . When the filter fits entirely in the input, the dot product is computed between the elements of the filter and the corresponding elements of I .
be two input tensors for the convolution operator representing images and filters, respectively. The results of the opera- , δ h and δ w are the strides and (2)

Homomorphic Encryption
An HE scheme is an encryption scheme that allows us to evaluate any circuit, and any function, on encrypted data. A survey is available in [23]. Modern HE instantiations such as CKKS [13] rely on the hardness of the Ring-LWE problem, support SIMD operations, and operate over rings of polynomials. They include the following methods Gen, Enc, Dec, Add, Mul and Rot which we now briefly describe. The function Gen generates a secret key public key pair. The function Enc gets a message that is a vector M[s] and returns a ciphertext. Here, s denotes the number of plaintext vector elements (slot count). It is determined during the key generation. An approximation scheme, such as CKKS [13], is correct up to some small error term, i.e., |M(i) − Dec(Enc(m))(i)| ≤ ϵ, for some ϵ > 0 that is determined by the key.

Reducing Convolution to Matrix-matrix Multiplication
In HE settings it is sometimes useful to convert a convolution operation to a matrix-matrix multiplication by pre-processing the input before encrypting it. One such method is image-to-column [10], which works as follows for the case c = b = 1. Given an image where each row holds the content of a valid window location in I flattened to a row-vector, and F ′ [w F h F , f ] contains every filter of F flattened to a column-vector.
The advantage of this variant is that the output is fully flattened to a column vector, which is useful in situations where flattening is costly (e.g., in HE, where encrypted element permutations are required). The drawback of the im2col method is that it is impossible to perform two consecutive convolution operators without costly pre-processing in between. Moreover, this pre-processing increases the multiplicative-depth, deeming it impractical for networks such as AlexNet, where the multiplicative-depth was already near the underlying HE-library's limit. In comparison, our convolution methods do not require preprocessing between two consecutive calls.

Threat Model
Our threat model involves three entities: An AI model owner, a cloud server that performs model inference on HE encrypted data using the pre-computed AI model, and users who send confidential data to the cloud for model inference. Among other, we take into account the following three models: a) the users and the model owner belong to the same organization, where both have access to the private key and are allowed to see the encrypted data; b) the user is the private-key owner. The model owner can use the public key to encrypt its model before uploading it to the cloud. This scenario involves a non-collusion assumption between the user and the cloud; c) the model-owner is the private-key holder. The user uses the public key to encrypt its samples before uploading them to the cloud. The model owner can decrypt and distribute the inference or post-processed results to the users. This scenario involves a non-collusion assumption between the model-owner and the cloud. In all cases, the cloud should learn nothing about the underlying encrypted data. In Scenarios b and c the users should not learn the model excluding privacy attacks, where the users try to extract the model training data through the inference results. We assume that communications between all entities are encrypted using a secure network protocol such as TLS 1.3, i.e., a protocol that provides confidentiality, integrity, and allows the model owner and the users to authenticate the cloud server. The model owner can send the model to the server either as plaintext or encrypted. When the model is encrypted, the server should learn nothing about the model but its structure. In both cases, we assume that the cloud is semi-honest (or honest-but-curious), i.e., that it evaluates the functions provided by the data owner and users without any deviation. We stress that our packing methods modify the data arrangement before encrypting it and thus do not affect the semantic security properties of the underlying HE scheme. Finally, in our experiments we target an HE solution with 128-bit level security.

OUR TILE TENSOR FRAMEWORK
HE libraries such as HElib [24] and SEAL [1] provide simple APIs for their users (e.g., encrypt, decrypt, add, multiply, and rotate). Still, writing an efficient program that involves more than a few operations is not always straightforward. Providing users with the ability to develop complex and scalable HE-based programs is the motivation for developing higher-level solutions such as our library, NGraph [8], and CHET [17]. These solutions rely on the low-level HE libraries while offering additional dedicated optimizations, such as accelerating NNs inference on encrypted data. Figure 1 provides a simplified schematic view of the layers we use in our library. The first two layers include the low-level HE libraries and their underlying software and hardware math accelerators. Our library [39] involves the three upper layers. At the bottom of these layers is the HE abstraction layer, which makes our library agnostic to the underlying HE library. The next layer is the tile tensor framework layer. It contains the tile tensor data structure (Section 4) that simplifies computation involving tensors, and the packing optimizer (Section 6) that searches for the most efficient tile tensor packing configuration for a given computation. The AI layer is built on top of these. It implements AI related functionality, such as reading NNs from standard file formats, encrypting their weights, and computing inference.
In this paper we focus on the tile tensor framework layer, and how it contributes to the optimization of NN inference computations. The optimizations this layer offers focus on packing and related algorithms, and can thus be combined with other types of optimizations in other layers. Note that our optimizer (Section 6) is simulation based, and therefore can take into account optimizations in any layer below this layer. We refer the reader to [40] for APIs and code of the HE abstraction layer.

TILE TENSORS
The tile tensor framework layer uses our new data structure, which named tile tensor. A tile tensor is a data structure that packs tensors in fixed size chunks, as required for HE, and allows them to be manipulated similar to regular tensors. It has an accompanying notation for describing the packing details.
We briefly and informally describe both data structure and notation. The notation is extensively used in the rest of the paper, and it is summarized for quick reference in Table 1. For completeness, we provide formal definitions in Appendix B.

Tiling Basics
We start by describing a simple tiling process in which we take a tensor A[n 1 , n 2 , . . . , n k ], and break it up into equal-size blocks that we call tiles, each having the shape [t 1 , t 2 , . . . , t k ]. In our implementation, a tile is an HE ciphertext.
We construct a tensor E[e 1 , e 2 , . . . , e k ], which we call the external tensor, such that each element of E is a tile, and e i = n i t i . Thus, All other slots in E that were not mapped to any element of A will be set to 0. Figure 2 demonstrates three examples for tiling a matrix M [5,6]. In Figure 2c, for example, the shape of the external tensor is [3,2], and the tile shape is [2, 4].

The Tile Tensor Data Structure
A tile tensor is a data structure containing an external tensor as described above, and public meta data called tile tensor shape. The tile tensor shape defines the shape of the tiles, the shape of the original tensor we started with, and some additional details about the organization of data inside the external tensor, which we describe later.
We use a special notation to denote tile tensor shapes. For example, [ n 1 t 1 , n 2 t 2 , . . . , n k t k ] is a tile tensor shape specifying that we started with a tensor of shape [n 1 , . . . , n k ] and tiled it using tiles of shape [t 1 , . . . , t k ]. In this notation, if t i = 1, then it can be omitted. For example, [ 5 1 , 6 8 ] can be written [5, 6 8 ]. Figure 2 shows three examples along with their shapes.
A tile tensor can be created using a pack operation that receives a tensor A[n 1 , . . . , n k ] to be packed and the desired tile tensor shape: The pack operator computes the external tensor using the tiling process described above, and stores along-side it the tile tensor shape, to form the full tile tensor T A . We can retrieve A back using the unpack operation: A = unpack(T A ). As with regular tensors, we sometimes refer to a tile tensor T A together with its shape:

Replication
For some computations it is useful to have the tensor data replicated several times inside the tile slots. The tile tensor shape indicates this by using the * t i notation. It implies that n i = 1, but each element of the original tensor is replicated t i times along the i'th dimension. When packing a tensor A[n 1 , . . . , n k ] and n i = 1, and (a) T M [5, 6 8 ]  [1,8], [8,1], and [2, 4], respectively. For these, the external tensor shape is [5,1], [1,6], and [3, 2], respectively. The value of M [5,6] in the ith row and jth column is the value ij. with a tile tensor shape specifying * t i , then the packing operation performs broadcast(A, [n 1 , . . . , t i , . . . , n k ]) and tiles the result. The unpacking process shrinks the tensor back to its original size. The replications can either be ignored, or an average of them can be taken; this is useful if the data is stored in a noisy storage medium, as in approximate HE schemes. Figure 3 demonstrates packing V [5, 1] with different tile tensor shapes. Note that we only replicate inside a tile. When broadcasting requires replicating whole tiles it is possible to hold only one copy of it and use it multiple times.

Unknown Values
When tensors are packed into tile tensors, unused slots are filled with zeroes, as shown in Figure 2. Section 4.5 shows how to apply operators on tile tensors, and then unused slots might get filled with arbitrary values. Although these unused slots are ignored when the tile tensor is unpacked, the presence of arbitrary values in them can still impact the validity or performance of applying additional operators. To reflect this state, the tile tensor shape contains an additional flag per dimension, denoted by the symbol "?", indicating the presence of unknown values. Figure 3d shows a tile tensor with the shape [ 5 2 , 1? 4 ]. The "?" in the second dimension indicates that whenever we exceed the valid range of the packed tensor along this dimension, we may encounter arbitrary unknown values. However, it still holds that V = unpack(T V ), as these unused slots are ignored.

Operators
Tile tensor operators are homomorphic operations between tile tensors and the packed tensors they contain. For two tile tensors T A and T B , and a binary operator ⊙, it holds that unpack( . Unary operators are similarly defined. Remark 2. The above operations involve two levels of homomorphism. One homomorphism is between direct operations on ordinary tensors A, B and operating on the tile tensors T A , T B that contain them. The tile tensor operators are implemented by performing operations on encrypted tiles. Due to the properties of homomorphic encryption (see Section 2.3), these are homomorphic to ordinary operations on the plaintext content of the tiles.
Binary elementwise operators are implemented by applying the operator on the external tensors tile-wise, and the tile tensor shape is updated to reflect the shape of the result. If the inputs have identical shapes then so do the results, e.g.,T ′′ M [ 5 2 , 6 4 ] of Figure 2c can be multiplied with an identically packed matrix T N [ 5 2 , 6 4 ], resulting in T R [ 5 2 , 6 4 ], where R = M * N . As with regular tensors, the tile tensor shapes need not be identical, but compatible. Compatible tile tensor shapes have the same number of dimensions, and for each dimension specification they are either identical, or one is * t i and the other is n i t i . The intuition is that if the tensor is already broadcast inside the tile, it can be further broadcast to match any size by replicating the tile itself. For example, for 6 4 ]. We can also compute T ′′ i.e., with unknown values in unused slots along the second dimension. This occurs because in T ′ V this dimension is filled with replicated values, and after the addition they fill the unused slots of the result. Computing Figure 3b) is illegal because their shapes are not compatible. For the full set of rules for elementwise operators see Appendix B.
The sum operator is also defined homomorphically: unpack(sum(T A , i)) = sum(unpack(T A ), i). It works by summing over the external tensor along the i'th dimension, then by summing inside each tile along the i'th dimension. In an HE environment, the latter summation requires using the rotate-and-sum algorithm [3]. Generally, the sum operator reduces the i'th dimension and the resulting tile tensor shape changes to 1? t i . However, there are some useful special cases. If t i = 1, then it is reduced to 1 1 or simply 1. When i is the smallest i such that t i > 1, the dimension reduces to * t i , i.e., the sum results are replicated. This is due to properties of the rotate-and-sum algorithms. It is a useful property, since this replication is sometimes needed for compatibility with another tile tensor. For example, let T A be a tile tensor with the shape [4, 3 8 , 5 16 ]. Then, sum( . Three other operators used in this paper do not change the packed tensor, just the external tensor and tile tensor shape. The clear (T A ) operator clears unknown values by multiplying with a mask containing ones for all used slots, i.e., it removes the "?" from the tile tensor shape. For example, Figure 3). The rep(T A , i) operator assumes the i'th dimension is 1 t i , and replicates it to * t i , using a rotate-and-sum algorithm. The f latten(T A , i, j) operator flattens dimensions i through j assuming they are all replicated. This is done trivially by just changing the meta data, e.g., f latten(

Higher Level Operators
Using elementwise operators and summation, we can perform various algebraic operations on tile tensors.
The above formula works for any value of a, b, t 1 , t 2 . This is because the tile tensor shapes of T M and T V are compatible, and therefore, due to the homomorphism, this computes R[a, 1] = sum(M[a, b] * V [1, b], 2), which produces the correct result as explained in Section 2.2.
A second option is to initially transpose both M and V and pack them in tile tensors . Now we can multiply them as: This computes the correct result using the same reasoning as before. The benefit here is that the result T R [ * t 1 , a t 2 ] is replicated along the first dimension due to the properties of the sum operator (Section 4.5). Thus, it is ready to play the role ofT V in Formula 3, and we can perform two matrix-vector multiplications consecutively without any processing in between. The output of Formula 3 can be processed to fit as input for Formula 4 using rep(clean(T R ), 2).
Matrix-matrix multiplication. The above reasoning easily extends to matrix-matrix multiplication as follows. Given matrices M 1 [a, b] and M 2 [b, c], we can compute their product using either of the next two formulas, where in the second one we transpose M 1 prior to packing. As before, the result of the second fits as input to the first.

Interleaved Tiling
Another option for tiling is denoted by the symbol "∼" in the tile tensor shape. This symbol indicates that the tiles do not cover a contiguous block of the tensor, but are spread out in equal strides. For each dimension, we can specify separately whether it is interleaved or not. For example, in [ 5 2 , 6∼ 4 ] only the second dimension is interleaved. Also, although with basic tiling it holds that e i = n i t i , for interleaved tiling it is sometimes useful to have larger values for e i . In this case, this value can be explicitly stated using the notation: Interleaved dimensions fit seamlessly with all the operators described above, and are useful for computing convolutions. See Section 5 for more details.

CONVOLUTION USING TILE TENSORS
In this section we describe our novel method to compute convolution. Compared to previous work (e.g. [29,44]) our method has two advantages: (i) it is more efficient when the input is a large image and (ii) it allows efficient computation of consecutive convolution layers in an HE only system.
We first consider in Section 5.1 the convolution problem in its simplest form: a single, one-channel image, and a single filter We extend it to convolution with strides and multiple channels, filters, and batching in Section 5.2.

Convolution with Interleaved Dimensions
We now show how interleaved dimensions (see Section 4.7) can be used to efficiently compute convolution. Figures 4a and 4b show a matrix M [6,6] packed in the tile tensor T M [ 6∼ 2 , 6∼ 4 ]. Here, the tile shape is [2, 4] and the external tensor shape is [3,2]. Every tile contains a 2 × 4 sub-matrix, but instead of being contiguous, it is a set of elements spaced evenly in the matrix. We use the same color to denote elements that are mapped to the same slot in different tiles. For example, the elements 00, 01, 10, 11, 20, 21 are all colored red as they are mapped to slot (0, 0) in 6 different tiles. This coloring divides the original matrix into contiguous regions.
The interleaved packing allows for a more efficient implementation of Equation 2 with respect to runtime and storage. Intuitively, we use the SIMD to compute multiple elements of the output in a single operation. The filter is packed simply as T F [w F , h F , * t 1 , * t 2 ]. I.e., it has w F h F tiles, each containing one value of the filter in all slots. This allows multiplying each image tile with each value of the filter.
For example, Figure 4c shows a computation of the convolution output when the filter is placed at the top left position. The SIMD nature of the computation computes the output in other regions as well. The result is a single tile, where each slot contains the convolution result of the corresponding region, such that this tile is packed in the same interleaved packing scheme as the input tiles.
A more complicated example is given in Figure 4d. Here the filter is placed one pixel to the right. As a result, the filter needs to be multiplied by elements that appear in different regions, i.e. they are mapped to slots of different indices. In this case we need to rotate the tiles appropriately. For example, placing the filter with its upper left corner on pixel (0, 1), the convolution is computed using the (0, 0) slot of tiles (0, 1) and (1, 1) and slot (0, 1) of tiles (0, 0) and (1, 0). The latter two are therefore rotated to move the required value to slot (0, 0) as well.
The total cost of convolution when using this packing is summarized in the following lemma.  The convolution is computed similarly to the description in Section 5.1, multiplying tiles of T I with the appropriate tiles of T F . The result is a tile tensor of shape , where the log t 3 ) factor comes from the sum-and-rotate algorithm [3].

A Sequence of Convolutions
In this section we discuss how to implement a sequence of multiple convolution layers. This is something that is common in neural networks. One of the advantages of our tile tensor method is that the output of one convolution layer can be easily adjusted to be the input of the next convolution layer.
Assume we are given an input batch tensor I [w I , h I , c, b] and a sequence of convolution layers with the l'th layer having a filter tensor For the first layer we have c 1 = c, and for l > 1 we have c l = f l −1 . As before, we pack the input tensor as . For odd layers, l = 2ℓ + 1, we pack the filter To make an output of an odd layer suitable for the next even layer, we clear the unknowns by multiplying with a mask and then replicate the channel dimension. We then get a tile tensor of this , which matches the input format of the next layer since f l = c l +1 . To make an output of an even layer suitable for the next odd layer, we similarly clean and replicate along the filter dimension.
We note that changing the order of the dimensions leads to a small improvement. The improvement comes because summing over the first dimension ends up with a replication over this dimension. Therefore, setting the channel dimension as the first dimension saves us the replication step when preparing the input to an even layer. We can skip cleaning as well, since the unknown values along the image width and height dimensions do no harm. Alternatively, the filter dimension can be set as first and then the replication step can be skipped when preparing the input for an odd layer.

Naïve Convolution Methods
The above method reduces to a simple method known by various names such as SIMD packing when t 1 = t 2 = t 3 = t 5 = 1. In this case, every element in the tensors for the images and filters is stored in a separate ciphertext, and the slots are only used for batching. In this paper, we further use the reduction to matrix multiplication as described in Section 2.4. It is applicable only for NNs with one convolutional layer.

THE OPTIMIZER
The packing optimizer is responsible for finding the most efficient packing arrangement for a given computation, as well as the optimal configuration of the underlying HE library. This relieves the users from the need to handle these HE related complexities. The users only need to supply the model architecture e.g., a NN architecture, in some standard file format. The optimizer automatically converts it to an HE computation with optimal packing and optimal HE library configuration. Users can further supply constraints such as the required security level or maximal memory usage, and choose an optimization target, whether to optimize for CPU time, latency or throughput, or optimize for memory usage. Our optimizer has many similarities to the one of CHET. However, it has one principal difference, it operates over and understands our concept of tile tensors. This allows it, for example, to explore different intermediate batching levels and to suggest different tradeoffs between latency and amortized latency, which was not discussed or analyzed in CHET.
The optimizer needs to choose among different possible configurations of the HE library, as well as different packing techniques to support certain operators (see Section 5). It also chooses the tile shape, i.e., the values of t 1 , t 2 , . . ., in the tile tensor shapes. For example, consider an HE scheme configured to have 16,384 slots in each ciphertext. Since our convolution operator uses fivedimensional tiles, the number of possible tuples t 1 , . . . , t 5 such that i t i = 16,384 is log 2 (16,384)+5−1 5−1 = 3,060. We use the term configuration to refer to a complete set of options the optimizer can choose (HE configuration parameters, tile shape, and other packing options).  Step 4, see also Alg 2 in Appendix C), including the packing details and HE configuration details applicable for this architecture. The simulator unit tests every such configuration and outputs the following data (res at Step 6) for each: the computation time of the different stages including encrypting the model and input samples, running inference, and decrypting the results; the throughput; the memory usage of the encrypted model; input; and output; and more. The optimizer passes this data to the cost evaluator for evaluation (Steps 7-16). Finally, it returns the configuration option that yields the optimal cost to the user (among the tested configurations), together with the simulation output profile (Step 17).
Configuration generator. The configuration generator unit receives the model architecture, and generates all applicable configurations for it. For example, if the model has a single convolutional layer it will generate three basic configurations with three possible convolution implementations: the im2col based method, and the two options of our novel method (see Section 5). If the model has multiple convolutional layers, the im2col based method will not be applicable. From each of these three basic configurations the generator will create multiple complete configurations by exploring all possible tile shapes. The generator explores possible tile shape using one of two strategies. The first involves brute forcing over all valid options for tile shapes (Alg. 1). Since these may be numerous, a second strategy searches using a "steepest ascent hill climbing" local search algorithm, which requires replacing Step 4 in Alg. 1 with an adaptive GenerateConf function call.
The local search starts with a balanced tile shape, where the number of slots in every dimension is of the same order. This is a heuristic designed to avoid evaluating tile shapes that are likely to be computationally costly at the beginning of the search. We then iteratively evaluate all the neighbor tile shapes of the current shape and continue to the best-improving neighbor as long as one exists. We consider two tile shapes as neighbors if we can obtain one shape from the other by multiplying or dividing the size of some of its dimensions by two. It is possible to use other small multipliers and dividers, but these may involve unnecessary complex computations. We consider one shape as better than another shape based on the costs received from the cost evaluator. Using the local search algorithm highly speeds up the search process and we found empirically that it often results in a global optimum. This was the case in our AlexNet, SqueezeNet-CIFAR and CryptoNets benchmarks.
Simulator. The simulator receives as inputs the model architecture and a configuration option. At this stage, we can evaluate the configuration by running it on encrypted input under HE. To reduce computational costs, the simulator uses pre-calculated benchmark values such as the CPU time of every HE operation and the memory consumption of a tile (i.e., the memory consumption of a single ciphertext). Then, it evaluates the model on mockup tile tensor objects using these benchmarks. These mockup tile tensors contain only meta data and gather performance statistics. Using this approach, the simulator can simulate an inference operation several order-of-magnitudes faster than when running the complete model on encrypted data. Appendix D reports the simulator accuracy on AlexNet. We stress that performing an analytical evaluation instead of an empirical one over a large NN requires considering many parameters, which is eventually equivalent to running our simulator using the mockup tile tensors. To allow scientific comparisons of different packings, the simulator also outputs the total number of rotations and multiplications per evaluated circuit and per layer.
Cost evaluator. The cost evaluation unit evaluates the simulator output data considering the constraints and optimization targets provided by the user. After testing all possible configurations, the highest scoring configuration(s) is sent back as output to the user.
Evaluating the optimizer performance. To demonstrate the advantage of using both the local search algorithm and the simulator, we performed experiments using AlexNet (see Section 7.2 and Appendix A). Here, we fixed the number of slots to 16,384, the minimal feasible size for a NN that deep, and set the batch size to 1. The number of configuration options was 1360, with 680 different tile shapes for each convolution packing method. An exhaustive search that uses simulations took 5.1 minutes. In contrast, the local search algorithm took only 6.4 seconds and returned the same result. It did so after evaluating only 40 tile shapes.
Running the local search method on actual encrypted data took 9.95 hours. Using the simulator time estimations, we predict that exhaustive search on encrypted data would take ∼ 167 days (assuming unlimited memory). This demonstrates the importance of the mockup-based simulator.

EXPERIMENTAL RESULTS
Our experiments involve the model weights of a small NN (Cryp-toNets [21]) and a large NN (AlexNet [31]) that we trained on the MNIST [33], and COVIDx CT-2A [22] data sets, respectively. We report the results of performing model inference using these weights in encrypted and unencrypted forms. We use AlexNet to demonstrate the power of our framework and CryptoNets to demonstrate the effect of different packing on the computation performance and memory. Technical details of the environment we used for the experiments are described in Appendix A.

CryptoNets
The CryptoNets [21] architecture and the HE parameters that we use in our experiments are described in Appendix A.1. Generally speaking, this network has a convolutional layer followed by two fully connected layers. Table 2 reports the latency and memory usage for performing a model inference with different tile shapes when t 3 = b = 1. For brevity, we only consider t 1 to be at the extreme points (e.g., t 1 = 1, 8,192) or t 1 value that led to best performing solution, and some additional samples. The best latency and memory usage are achieved for t 1 = 32, which allows packing the tensors I, F,W 1 using the minimal number of tiles. Table 3 reports the latency, amortized latency, and memory usage for performing a model inference with different t 3 = b values. For every such value, we only report the t 1 , t 2 values that led to the optimal solutions. Unlike the case where b = 1, here every choice of t 3 leads to a different trade-off between the performance measures. For example, when increasing t 3 , the latency and memory consumption increase, but the per-sample amortized latency decreases. The encryption and decryption time also increase with t 3 , except for the case t 3 = 8,192, where we use the naïve SIMD convolution operator.

AlexNet Benchmark
For this benchmark, we used a variant of AlexNet network [31] that includes 5 convolution layers, 3 fully connected layers, 7 ReLU activations, 3 BatchNormalization layers, and 3 MaxPooling layers. Following [5,6], we created a CKKS-compliant variant of AlexNet by replacing the ReLU and MaxPooling components with a quadratic polynomial activation and AveragePooling correspondingly along with some additional changes. We trained and tested it on the COVIDx CT-2A dataset, an open access benchmark of CT images   [22]. We resized the images to 224 × 224 × 3 to fit the input size expected by AlexNet. See more details in Appendix A.1. For the convolutional layers, we used the packing methods described in Section 5.3. The biases were packed in 5-dimensional tile tensors with compatible shapes, allowing us to add them to the convolution outputs. The fully connected layers were handled using the matrix-matrix multiplication technique of Section 4.6. The input to these layers arrives from the convolutional layers as a 5-dimensional tile tensor, [ * through 4, then flattened using the flatten operator to , from which we can continue normally. We measured the accuracy of running vanilla AlexNet [31] and the HE-friendly AlexNet [6] using PyTorch 1 over a plaintext testset. The results were 0.861 and 0.806, respectively. We did not observe additional accuracy degradation when running the HEfriendly AlexNet using our framework over encrypted data. We emphasize that the above accuracy-drop results from using HEfriendly NN and not from using our framework. We expect that future AI improvements will close this gap by offering improved HE-friendly NNs. Table 4 reports the time and memory consumption for the latter experiment using 4 configurations on a set of 10b representative samples. The configurations involve unencrypted model weights (PT ) and encrypted model weights (CT ) optimized for low latency (Latency) or high throughput (TP). For these configurations, we also compared the inference results with the inference results of running HE-Friendly AlexNet on PyTorch over the plaintext test-set by calculating the Root Mean Square Error (RMSE). These were always less than 4e−3. ]. Similar to [25], we assume the naïve technique uses the naïve O(d 3 ) algorithm over ciphertexts that hold one value each. Some previous works bound the dimension d by some function of s.

COMPARISON WITH STATE-OF-THE-ART 8.1 Matrix Multiplication
In [25], the authors assume d < s, and in [28], the authors assume that d < √ s. The latter bound was also used in [12] for a version addressing a malicious adversary, and in [11] for a version addressing multi-key encryption. In both cases, s can be made arbitrarily large in two ways. We explain and compare these works to our technique, which does not place any bound on d.
Increasing the scheme's parameters. By increasing the parameters of the scheme, s can be made arbitrarily large. However, this increases the relative overhead of every HE operation, i.e., increasing s by a factor α increases the time of every HE operation by a factor α ′ > α. Moreover, in practice, there is a maximal setting that every HE library supports. Our technique is superior because it enables us to use the smallest s allowed by the security requirement for the multiplication.
Arbitrarily simulating many slots.  ), which is similar to that of [28].But again, our technique has a lower multiplication depth because it does not involve masking.
When considering complete NN, the authors of [28] argue that it is possible to convert fully-connected and convolutional layers to 1 PyTorch library 1.5.1 https://pytorch.org matrix-matrix multiplications. In that sense, our method, as shown above, are better.

Convolution
The state-of-the-art in implementing convolutional layers includes [29,30,44]. We compare these to our implementation by distinguishing between two cases: simple single-input-single-output (SISO) and multiple-input-multiple-output (MIMO).

SISO.
The simple SISO case involves one image, one input channel, and one filter, b = c = f = 1. Here, all previous approaches pack the image into a single ciphertext and compute the convolution using w F h F − 1 rotations (excluding the 0 rotation) and w F h F multiplications. However, the assumption that the image fits in a single ciphertext limits the maximal allowed image size. As mentioned above, all HE libraries limit the size of their ciphertexts. For example, SEAL [1] limits the ciphertext slots to 16,384, which is not large enough to contain 224 × 224 images such as those required for AlexNet. While it is possible to simulate a large ciphertext (see Section 8.1), this increases the multiplication depth by at least 1 per convolutional layer. For deep networks such as AlexNet, which have 5 convolutional layers, the extra cost of 5 multiplication levels makes the computation impossible when using libraries such as SEAL, where the maximal depth is bounded and bootstrapping is not supported. Our approach captures the method above as a special case where w I ≤ t 1 and h I ≤ t 2 , i.e., the image fits inside a single tile. Our generalization offers two advantages. First, it allows the network to efficiently handle images larger than the maximal ciphertext size; this was not described in previous works. As mentioned above, naïvely simulating larger ciphertexts using n smaller ones requires O(n) rotations and increases the multiplication depth. In contrast, in our method, if we divide the image into t square tiles, it multiplies the number of rotations by O( rotations. Second, our method makes it beneficial to split even medium size images into smaller tiles. This improves performance on three accounts: 1) smaller ciphertexts are more efficient per-slot, i.e., if an operation on a ciphertext of s slots costs t time, then the same operation on a ciphertext of 2s slots costs more than 2t time; 2) as mentioned above, the number of rotations increases sub-linearly; 3) many small operations are easier to parallelize than a few large operations. Table 6: Comparing time and number of operations for a SISO (b = c = f = 1) convolution benchmark of image of size 128 × 128, filter size 3 × 3, and different tile sizes. Each row has a different setting for t 1 , t 2 , and t 3 = t 4 = t 5 = 1 (see Section 5). The first row captures as a special case existing SISO methods [29,30,44]. Subsequent rows demonstrate the performance advantage of our generalized method.  Table 6 demonstrates this advantage. It shows a SISO convolution computed with different ciphertext sizes, starting with 16,384 slots shaped as a 128 × 128 tile. The computations go down to the smallest ciphertext of 2,048 slots, which still supports secure computation. The table shows the speedup (more than 6×), the sub-linear growth in the number of rotations, and the linear growth in the number of multiplications. If more computation depth is required, the ciphertexts cannot be made smaller due to security constraints. In this case, we can reduce t 1 and t 2 , while increasing t 5 , the batch dimension. Table 6 will stay the same, except the rows will support batch sizes of 1, 2, 4, 8 respectively, with amortized time per sample of 0.119, 0.068, 0.061, 0.016 seconds.

MIMO.
In MIMO, the number of channels c and filters f is larger than 1.
As a special case, the tile tensor and its accompanying notation captures the packing options described in [29,30,44], which is . Previous approaches restricted w I ≤ t 1 , h I ≤ t 2 , but c and t 3 can be set arbitrarily. If c ≤ t 3 then all channels are packed into a single ciphertext. If t 3 = 1, then each channel is in a separate ciphertext.
In HEAR [30] an additional packing optimization is described. They note that after a mean pooling layer with strides, the output comes out with strides δ , i.e., only the pixels at x and y coordinates such that x mod δ = 0 and y mod δ = 0 are populated with actual pixels. They use the vacant pixels to pack more channels. Using tile tensor shapes this is equivalent to w I w I , δ δ , h I h I , δ δ , c t 3 , where w I , h I are the dimensions of the output image, and the vacant slots are represented by two new dimensions, such that w I δ = t 1 and h I δ = t 2 . We fill them with channel data, having a total capacity of δ 2 c channels. Packing those channels together and then summing over them requires processing time. We note that the interleaved packing method described in Section 5 reduces the need for such a packing optimization. As with interleaved packing, if δ divides the number of tiles along a dimension, the stride can be implemented by simply discarding tiles; this saves computation time and does not create vacant pixels.
Previous approaches described numerous techniques for exploiting hoisted rotations, and for optimizing the number of rotations in MIMO convolutions. Since all existing packing methods are a special case of tile tensor capabilities, these methods are applicable within our framework. Integrating them within our optimizer is reserved for future work. Table 7 compares our framework, experiments, and results with prior art. The first column (NI) indicates whether the framework supports non-interactive secure computations. This is a major advantage that HeLayers provides over GAZELLE [29], nGraph-HE [8], SeaLion [42], etc. In general, using the client to assist the server reduces the latency, which in turn enables the evaluation of deep networks such as ResNet and VGG. However, it misses the paradigm of delegating the entire computations to the cloud.

Neural Network Inference
The second column (PL) indicates whether the reported experiments showed practical inference latency of less than two hours. We use this Boolean comparison metric instead of comparing exact latencies because different frameworks reported their measurements on different network architectures, over different datasets with images of different sizes when using different platform configurations such as number of CPUs and memory size. Furthermore, some of the schemes e.g., CHET [17] and SeaLION [42] are not freely available online nor do they provide open source code that we can evaluate, where implementing their code from scratch may result in a biased implementation. Using the PL metric allows us to distinguish HeLayers from some works, while we use other comparison metrics to show HeLayers' superiority over the other solutions.
The following two columns indicate whether the papers reported on a mechanism or a method to evaluate and select the packing choice that mostly fits the given criteria. Specifically, it indicates whether they support tradeoffs between latency versus memory (LM) and amortized-latency versus latency (ALL). While most schemes considered only one type of packing, as far as we know only CHET suggested using different packing and selecting among them according to some criterion such as latency, energy, or memory. However, they did not present any amortized latency tradeoffs in their paper and they did not discuss how this selection mechanism works.
The other comparison metrics we considered include whether the model is encrypted before the inference evaluation, whether the security strength was at least 128 bits, and whether the image size was at least 224x224x3. All of these parameters highly affect the performance of the experimental evaluation. It can be observed from the table that HeLayers is the only framework that provides a non-interactive solution over large images and an encrypted model that considered different trade-offs and reported practical results that measure less than several minutes.
When considering the different packing proposals. TenSEAL [7] uses diagonalization techniques for matrix-multiplication and im2col for convolution, assuming a single, image as input. nGraph-HE2 [8] uses SIMD packing, which is a special case of our framework when optimized for the largest possible batch size. The closest  Scheme Name NI PL LM ALL EM SB Evaluated Networks Image size GAZELLE [29] 128 CryptoNets 32 × 32 × 3 Jiang et al. [28] 128 CryptoNets 28 × 28 × 1 nGraph-HE2 [8] 128 128 DNN-30, DNN-100, CNN-16 28 × 28 × 1 TenSEAL [7] 128 CryptoNets 28 × 28 × 1 DOReN [38] 128 VGG7, ResNet20, Lola, GHE, SHE 32 × 32 × 3 REDsec [19] 80 BAlexNet * * 224 × 224 × 3 Lee et al. [34] 111.6 ResNet-20 32 × 32 × 3 CHET [17] ⊥ 128 LeNet-5, CIFAR-Squeezenet, Industrial 32 × 32 × 3 HEMET [37] 128 AlexNet, CIFAR-Squeezenet, InceptionNet 32 × 32 × 3 HeLayers (ours) 128 AlexNet, CIFAR-Squeezenet, CryptoNets 224 × 224 × 3 * The reported latency is 63 hours for a batch of 2048 images (1.84 minutes of amortized latency). * * BinaryAlexNet with binary weights and accuracy drop to 61.5% ⊥ CHET supports batching but the paper does not provide a description, analysis, or measurements. work to ours is CHET [17], which uses a similar approach of an abstract data structure, CipherTensor, combined with automatic optimizations. We believe CipherTensors are less flexible than tile tensors. They include a small fixed set of implemented layouts, each with its kernel of algorithms, whereas tile tensors offer a wider variety of options with a single set of generalized algorithms. Further, it was not demonstrated that CipherTensors offer an easy method to trade latency for throughput and control memory consumption, as is possible in tile tensors by controlling the batch dimension. Finally, CipherTensors require replication of the input data using rotations, whereas some of these replications can be avoided using tile tensors. It might be the case that by this time CHET has been enhanced to include other optimizations, but we only compare our solution to data that is publicly available, i.e., to [17]. CHET is the closest framework to ours, and even though it does not provide a freely available solution or an open source one, we attempted below to provide as fair a comparison as possible to the latency reported in [17]. The authors of CHET demonstrated its efficiency for large networks by using SqueezeNet-CIFAR, which is a reduced [15] SqueezeNet [26] model over CIFAR-10. It has at least 50× fewer parameters than AlexNet but it is a bit deeper (23 layers instead of 17). see Appendix A.1. We evaluated SqueezeNet-CIFAR on HeLayers and report the results in Table 8. For a fair comparison with CHET, we reduced the number of cores to 16 while disabling hyper-threading as in CHET. In addition, the SqueezeNet-CIFAR repository [15] contains two models, we used the larger model among the two (the smaller model has 1.2× less latency than the larger model). When optimizing for latency, our framework is 1.24× faster than CHET. In addition, we also provide the results when optimizing for throughput and for running with encrypted models, which show even better amortized results (an 11.5× speedup over CHET).

CONCLUSIONS
We presented a framework that acts as middleware between HE schemes and the high-level tensor manipulation required in AI.
Central to this framework is the concept of the tile tensor, which can pack tensors in a multitude of ways. The operators it supports allow users to feel like they are handling ordinary tensors directly. Moreover, the operators are implemented with generic algorithms that can work with any packing arrangement chosen internally.
The optimizer complements this versatile data structure by finding the best configuration for it given the user requirements and preferences. We demonstrated how this approach can be used to improve latency for small networks, adapt to various batch sizes, and scale up to much larger networks such as AlexNet. Our tile tensor shape notation proved very useful for both research and development. Having the notation used in debug prints and error messages, configured manually in unit tests, and printed out in the optimizer log files, helped reduce development cycles considerably. Also, in this paper we used it to concisely and accurately describe complicated computations (e.g., Figure 6). We hope the community will adopt it as a standard language for describing packing schemes.
Our framework is the first to report successful and practical inference over a large (in terms of HE) NN such as AlexNet over large images. Today, we are in the process of extending our library to support bootstrap capabilities. Combining this and our framework should enable us to run even larger networks such as VGG-16 and GoogleNet, which until now were only reported for client-aided designs or for non-practical demonstrations. Table 9: Network architecture used in our experiments. C, A, P, D, and F denote convolution, activation, pooling, dense, and fire module layers, respectively. A fire module has the form C-A-C2-A, where C2 represents two concated convolutions.

B TILE TENSORS DEFINITION
Definition B.1 (External tensor). A k-dimensional external tensor E is a k-dimensional tensor that each of its elements is itself a k-dimensional tensor, all having an identical shape. These internal tensors are referred to as tiles, their shape as the tile shape, and the shape of the external tensor as the external shape. A slot in E is identified by E(a 1 , . . . , a k )(b 1 , . . . , b k ) where a i are the external indices of a tile, and b i are the internal indices inside the tile.

Definition B.3 (External tensor logical indices).
Given a tile tensor shape S, and an external tensor E, and a specific slot in E specified by external indices (a 1 , . . . , a k ), and internal indices (b 1 , . . . , b k ), then this slot is associated with the logical indices (c 1 , . . . , c k ) with respect to S, computed as follows: For i = 1, . . . , k, if the interleaved indicator l i is true, then c i = b i e i +a i else c i = a i t i +b i . Definition B.4 (Validity relation, Packed tensor). A tile tensor shape S is valid for an external tensor E if their external shapes and tile shapes match, and there exists a tensor T [n 1 , . . . , n k ] such that for T 1 = broadcast(T , [n 1 r 1 , . . . , n 2 r 2 ]) it holds that E(a 1 , . . . , a k )(b 1 , . . . , b k ) = T 1 (c 1 , . . . , c k ) for all slots with internal, external, and logical indices a i , b i , c i , such that ∀ i c i ≤ n i r i . For all other slots of E, if ∀ i ((c i ≥ r i n i ) → ¬u i )) then these slots are set to zero. T is the packed tensor.
Definition B.5 (Tile tensor). Tile tensor is a pair (E, S) where E is an external tensor and S a tile tensor shape that is valid for it.
Definition B.6 (Unpack operator). Given a tile tensor T A = (E, S) the operator unpack(E) results with the packed tensor of T A .
Definition B.7 (Pack operator). Given a tensor A and a tile tensor shape S whose original shape matches the shape of A, then the pack operator pack(A, S) results with a tile tensor T A = (E, S) such that A is the packed tensor of T A .

B.1 Tile Tensor Shape Notation
A tile tensor shape can be specified with a special notation involving a list of symbols. Each element in the list specifies the details of one dimension. n i t i specifies the original and tile shape along this dimension, and r i = 1, e i = n i t i , l i = u i = f alse. * r i t i further specifies the replication count and n i = 1, and * t i specifies n i = 1, r i = t i . n i ∼ t i specifies l i = true, and n i ∼e i t i specifies a value for e i other than the default mentioned above. For any of the above mentioned options a "?" symbol above the line indicates u i = true.

B.2 Operators
Definition B.8 (Tile tensor shape compatibility). Tile tensor shapes S and S ′ are compatible for all i, Definition B.9 (Tile tensor addition). Let T = (E, S) and T ′ = (E ′ , S ′ ) be tile tensors with compatible shapes, then T + T ′ = T ′′ = (E ′′ , S ′′ ) such that Definition B.10 (Tile tensor elementwise multiplication). Let T = (E, S) and T ′ = (E ′ , S ′ ) be tile tensors with compatible shapes, then T * T ′ = T ′′ = (E ′′ , S ′′ ) such that Definition B.11 (Tile tensor summation). Let T = (E, S) be a tile tensor such that for a given index i it holds that u i = f alse, r i = 1. Then T ′ = sum(T , i) is a tile tensor T ′ = (E ′ , S ′ ) computed as follows. Let E 1 = sum(E, i). E ′ is computed from E 1 by summing over the dimension i of every tile L of E 1 using the rotate-and-sum algorithms [3]. S ′ is identical to S except n ′ i = 1, and if ∀ j <i t j = 1 and t i is a power of 2, then r ′ i = t i , else u ′ i = (t i > 1). Remark 3. The output tile tensor shape of tile tensor summation is due to the behaviour of rotate-and-sum algorithms as explained in [3]. In environments where summing inside a tile can be performed differently, the shape might be different. Specifically, Some HE systems support native multi-dimensional structure to the ciphertexts, allowing rotating a tile along one of its dimensions directly. This allows having replicated output for any dimension. Definition B.12 (Tile tensor replication). Let T = (E, S) be a tile tensor and i be an index such that n i = 1, r i = 1, and ∀ j u j = f alse. Then T ′ = rep(T , i) is a tile tensor T ′ = (E ′ , S ′ ) computed as follows. E ′ is computed from E by applying replication along dimension i for on every tile L of E using the rotate-and-sum algorithms of [3]. S ′ is identical to S except r ′ i = t i .

B.3 Tile Tensor Glossary
Below is a short summary of tile tensor terminology.
• Tile A tensor of some fixed shape, stored flattened inside a vector and operated on in SIMD fashion.
• Tile tensor A data structure containing an external tensor as data and a tile tensor shape as meta data. • External tensor A tensor with tiles as elements.
• Tile shape The shape of tiles in the external tensor. • Tile tensor shape Meta data specifying the original shape, tile shape, and additional packing details. • Packed tensor The tensor that will be the result of unpacking a tile tensor. • Original shape The shape of the packed tensor.

C OPTIMIZER ALGORITHMS Algorithm 1 The optimizer operation
Input: m -a JSON description of a machine learning model.
Req -a key-value dictionary of user requirements.
O -an ordered list of optimization targets e.g., 'latency' and 'memory'.

D OPTIMIZER ACCURACY
The optimizer's simulator (Section 6) estimates the time and memory usage for a given configuration option on a single CPU thread.
For that, it relies on pre-benchmarked measures of the different HE operations. To assess the accuracy of these estimations, we performed the following experiment on HE-friendly AlexNet using encrypted model. We chose the four configuration options that achieved the lowest estimated latency when using local search (Section 6) and compared the inference time and the encryption time of the input and the model between the simulation output and an actual run over encrypted data. Table 10 summarizes the results. Our empirical tests show that the simulator provides relatively accurate time estimations for all four configurations. The average estimated time deviation is -15.8%, -11.9%, and -7.2% for inference, model encryption, and batch input encryption, respectively. We note that the simulated storage matches the measured storage for all configurations, thus we do not include this data in Table 10.