Hardware/Software Co-Design of an Accelerator for FV Homomorphic Encryption Scheme Using Karatsuba Algorithm

Somewhat Homomorphic Encryption (SHE) schemes allow to carry out operations on data in the cipher domain. In a cloud computing scenario, personal information can be processed secretly, inferring a high level of confidentiality. For many years, practical parameters of SHE schemes were overestimated, leading to only consider the FFT algorithm to accelerate SHE in hardware. Nevertheless, recent work demonstrates that parameters can be lowered without compromising the security [1] . Following this trend, this work investigates the benefits of using Karatsuba algorithm instead of FFT for the Fan-Vercauteren (FV) Homomorphic Encryption scheme. The proposed accelerator relies on an hardware/software co-design approach, and is designed to perform fast arithmetic operations on degree 2,560 polynomials with 135 bits coefficients, allowing to compute small algorithms homomorphically. Compared to a functionally equivalent design using FFT, our accelerator performs an homomorphic multiplication in 11.9 ms instead of 15.46 ms, and halves the size of logic utilization and registers on the FPGA.


INTRODUCTION
Homomorphic Encryption schemes are considered as promising in modern cryptography, because they directly allow to carry out operations on data in the cipher domain. Figure 1 illustrates a basic client/server transaction in an homomorphic scenario. The most flexible ones, called Fully Homomorphic Encryption (FHE) schemes, are able to process unlimited additions and multiplications secretly, and so make possible to address a wide range of algorithms. To reduce computation times, many applications only consider Somewhat Homomorphic Encryption (SHE) schemes, which bound the number of operations to reduce the complexity. While classical cryptographic schemes have sometimes homomorphic properties, for addition [2] or multiplication [3] operations, it has been necessary to wait until 2009 and C. Gentry [4] breakthrough to discover a way to perform both types of operations with limited restrictions. He provided an SHE scheme based on hard lattice problems, and then turned it into an FHE scheme by using the bootstrapping technique. But, due to the bootstrapping cost, FHE schemes are considered not so practical compared to SHE schemes. To reduce the complexity of Homomorphic Encryption, FHE/SHE schemes have been successfully adapted to re-Homomorphic operation(s): addition(s) multiplication(s) Server Fig. 1: Presentation of client/server transactions in an homomorphic encryption scenario. lated problems. The most popular ones are based on the Approximate-Great Common Divisor (a-GCD) problem [5] [6], NTRU problem [7] [8] and Ring-Learning With Error (R-LWE) problem [9][10] [11] [12] [13] [14]. In the following, we only consider SHE schemes with polynomial arithmetic, that is to say NTRU-based and R-LWE-based SHE schemes. Until recently, the most promising SHE scheme with polynomial arithmetic was YASHE' [7], closely followed by FV [11]. However, the recent so called subfield attack [15] considerably reduced YASHE' security, in particular the Decision Small Polynomial Ratio (DSPR) assumption [16] on which the security of the scheme relies on. Thus, FV regains interest because it does not suffer of the weakness of DSPR. Nevertheless, its ciphertext are double size when compared to those of YASHE', because the ciphertext is composed by 2 polynomials instead of 1 for YASHE'. Due to its practicability, previous hardware implementations of SHE schemes with polynomial arithmetic targeted YASHE' [17] [18], inferring a lack of hardware implementations of FV. To our knowledge, only software implementations of FV have been proposed. In [19], an implementation of FV using the multipurpose FLINT library [20] performs an homomorphic multiplication of FV in 148 ms for degree 4096 polynomials with 125 bits coefficients. Then, FV has been implemented to NFLlib [21], an efficient C++ implementation of ideal lattice cryptography. Authors can perform an homomorphic multiplication in 17.2 ms for the same parameters. Finally, work in [22] proposes to avoid the multi-precision arithmetic required by FV by using a Residue Number Systems (RNS) variant of FV. Authors report an homomorphic multiplication of FV for degree 4096 polynomials with 168 bits coefficients in 7.68 ms. However, this implementation has some limitations compared to the proposed accelerator that will be discussed in Section 5.2. Due to the proximity between YASHE' and FV, all previous hardware implementations of YASHE' are still relevant but need to be adapted. Thus, timings for hardware acceleration of the homomorphic multiplication of YASHE' are not directly comparable to timings provided earlier on software implementations of FV. Hardware accelerators for YASHE' implement fast arithmetic of degree n ∈ [4096, 32768] polynomials with coefficients of size log 2 q ∈ [125, 1228], depending on the required security and the complexity of the algorithm to be homomorphically performed. To our knowledge, all implementations are based on FFT algorithm [23]. In [17], a classical but optimized FFT implementation is presented for two parameter sets. The proposed accelerator performs an homomorphic multiplication in 6.5 ms for n = 4096 and log 2 q = 125 bits, and 48 ms for parameters n = 16384 and log 2 q = 512 bits. Authors of [17] implemented 512 × 512 bits multipliers with a small modular reduction by selecting a Solinas prime modulus [24]. Due to the size of polynomials and coefficients, a cache is implemented to connect the external memory used to store intermediate coefficients. They also reported a bottleneck due to the divide and rounding required by YASHE', especially for large integers. That is why in [18] a pre-computation is performed on polynomials to reduce the size of coefficients. They split a ciphertext into a few polynomials by using the Chinese Reminder Theorem (CRT) on each coefficient. The overall architecture is based on an array of crypto-units, which gives some flexibility to process several residue polynomials in parallel. For parameters n = 32768 and log 2 q = 1228 bits, their accelerator performs an homomorphic multiplication in 121 ms including 25 ms spent for CRT. Due to the security issue on YASHE', this paper proposes to accelerate the FV scheme in hardware using Karatsuba algorithm [25]. To our knowledge, this is the first use of Karatsuba for R-LWE based SHE schemes, and the first hardware accelerator dedicated to FV scheme. Compared to previous work using Karatsuba for polynomial multiplication, for example elliptic curve and paring-based cryptography, this work investigates polynomials with much larger degrees and with arbitrary size coefficients. Our accelerator implements fast polynomial arithmetic for degree 2560 polynomials with coefficients of size 125 bits, allowing homomorphic circuits of depth up to 4. This choice is motivated by the fact that for lower depths, alternatives exist and in particular the BGN-Based scheme in [26]. We demonstrate that for Homomorphic Encryption with low multiplicative depth circuits, Karatsuba can be a good alternative to FFT. We also evaluate the scalability and the limits of our approach compared to the FFT. In order to fairly compare our approach with previous works, we propose an hardware implementation on DE5-net platform from Terasic embedding an Altera Stratix V FPGA. However, our hardware accelerator does not require such a large FPGA and can possibly be implemented on smaller ones. The main contributions of this work are as follows: • A complete study of Karatsuba algorithm adaptation to SHE.
• An end to end solution for accelerating FV with a hardware/software co-design using Karatsuba algorithm.
• A latency-efficient software implementation of Karatsuba algorithm.
• A lightweight Karatsuba hardware accelerator. This paper is organized as follows. Section II recaps some key information on SHE cryptosystems based on a R-LWE problem. Section III draws some optimizations of FV with Karatsuba algorithm. Section IV details the proposed architecture and provides both software and hardware implementation details. Section V provides several discussions on our Karatsuba approach, in particular the scalability of the design. Section VI draws some conclusions.

Notation
In the following, a polynomial is represented with an uppercase and its coefficients with a lowercase. For polynomial A, a i represents its i th coefficient. A vector of polynomials is noted in bold. For vector A, A[i] is the i th polynomial of the vector. For set R and polynomial A, A ← U R represents a uniformly sampled polynomial in R and A ← χ σ a polynomial sampled in a discrete Gaussian distribution with standard deviation σ. For coefficient a i of polynomial A, a i,(j..k) corresponds to the binary string extraction of a i between bits j and k. This notation is extended to polynomial A where A (j..k) is the sub-polynomial where the binary string extraction is applied to each coefficient. Other standard operators are represented as follows. A modular reduction by an integer q is noted [ · ] q . For integer a, a , a and a operators are respectively the floor, ceil and nearest rounding operations. This is extended to polynomials by applying the operation on each coefficient. For vectors A and B, A, B represents

Karatsuba Algorithm
Karatsuba algorithm is an improvement of the classical polynomial multiplication algorithm which reduces the number of sub-products. For simplicity, we only address polynomials with an even number of coefficients, but Karatsuba can be easily adapted to odd ones by manipulating unbalanced sub-polynomial multiplications, or by using zero padding. Input polynomials A and B of degree n − 1 are split into two parts of equivalent size, that is to say n 2 coefficients. Let A H and A L be two polynomials composed respectively by the coefficients of highest degree of A and lowest degree of A. By the same way, one constructs B H and B L . Input polynomials are now expressed as A = A L + A H x n/2 and B = B L + B H x n/2 . When multiplying A and B by the standard approach, the resulting decomposition is given by: Karatsuba optimization is based on noticing that the middle factor (A L B H + A H B L ) can be cleverly computed by After several Karatsuba recursions, one has to perform many low degree polynomials multiplications instead of a large polynomial multiplication. This recursiveness allows sharing computations between the software and the hardware. For example, several recursions can be performed in software and the remaining ones in hardware. Because each Karatsuba recursion halves the size of subpolynomials, Karatsuba can achieve polynomial multiplication of degree 2 r (p + 1) − 1, where r is the number of Karatsuba recursions and p the degree of the smallest subpolynomial.

R-LWE Problem
A R-LWE instance is constructed in the ring Z q [X]/ f (X) = R q , where Z q = Z/qZ and f (X) is an irreducible degree n polynomial in Z q [X]. Resolving a R-LWE problem consists on recovering a polynomial S from the pair (AS + E, A), where S ← χ key , A ← U Rq and E ← χ err . If χ key and χ err are cleverly chosen, the R-LWE pair is mostly indistinguishable from the uniform distribution and its resolution is considered as hard as worst-case lattice problems. Usually, S is chosen from a binary set, and χ err with a standard deviation σ err > 2 √ n.

Cryptosystem FV
FV is a transposition of the scale-invariant Brakerski scheme [10] into the R-LWE problem. Let λ be the security parameter that determines (q, n) ∈ Z 2 , the parameters of a R-LWE instance. Let t ∈ Z with 1 < t < q be an integer which provides the upper bound of a message size, and ω ∈ Z q that splits an element of Z q into l ω,q = log 2 q/ log 2 ω elements. instance, and the secret key the polynomial S. This setup inevitably introduces an error term E called noise. During computations, the noise will grow until possibly making the decryption procedure faulty. An homomorphic addition is considered not critical because the noise is just added. For the homomorphic multiplication, the noise is multiplied, and infers a limitation on the number of operations achievable. Because the noise is mostly lead by the number of multiplications performed on a ciphertext, namely the multiplicative depth L, the impact of the homomorphic addition is usually neglected. In practice, this impact can possibly reduce the multiplicative depth if significant homomorphic additions are performed. Table 1 provides some parameters for FV extracted from [19] satisfying a security level λ of 80 bits. In particular, we used Equation (2) to calculate the upper bound of the modulus for a given degree and security level, and equations in Section 3.5 to extract the multiplicative depth for a given set of parameters. We also set ω to 27 bits to efficiently use hardware resources of the Stratix V. Additional information are given in Table 1, that will be discussed in Section 2.5.
While an homomorphic addition is just a polynomial addition of ciphertexts, an homomorphic multiplication requires an extra step after the polynomial multiplication called relinearization.
To understand why a relinearization step is required, it is important to notice that a ciphertext is proportional to the secret key S, plus an error. When multiplying two ciphertexts, the resulting polynomial is of the form of A + BS + CS 2 , proportional to S 2 . To continue homomorphic operations, the ciphertext needs to recover its initial form, and thus the knowledge of S 2 is required on the server-side, which is not acceptable for security purposes. Instead of manipulating S 2 directly, a sub R-LWE instance is created in order to hide S 2 . However, creating a sub-instance of R-LWE introduces an error term, which will penalize the multiplicative depth.
To address this issue, several optimizations are performed during the relinearization step based on two functions, FV.PowersOf ω,q and FV.WordDecomp ω,q : By cleverly using FV.WordDecomp w,q , one can perform a scalar product with sub-polynomials with coefficients of size log 2 w instead of log 2 q, and in the context of FV, multiply the error polynomial E by a polynomial with coefficients of size log 2 w instead of log 2 q. All primitives of FV are as follows: As one can see, all FV primitives are based on a few polynomial additions and multiplications. This is why speeding up polynomial multiplication is a good choice.

Choosing Parameters
In SHE, polynomial multiplication is typically implemented with FFT algorithm. To be efficient, FFT must be generated by a polynomial with irreducible factors of very small degree. That is why x n −1 and x n +1 are often chosen because they can be completely factorized with degree 1 factors. Moreover, because x n + 1 is also a cyclotomic polynomial, this method provides a solution where the polynomial reduction is directly integrated into the computation. This special FFT is called Negative Wrapped Convolution (NWC) and requires a FFT of size n instead of 2n in the standard case. However, this cyclotomic has an important issue: When factoring x n + 1 modulo 2, the resulting polynomial is (x + 1) n , which has a unique factor, namely (x + 1). This is incompatible with the CRT on polynomials because this latter requires factors with different polynomials. Thus, the NWC, which is optimized for performance, cannot pack several messages inside one ciphertext using CRT. This technique allows to perform the same homomorphic operation on each message in parallel, and is called batching. For further explanation on how to use CRT in the context of Homomorphic Encryption, reader can refer to [27].
Because we address the problem with Karatsuba algorithm, we have no particular restriction on input polynomials, and we can choose a cyclotomic polynomial with batching capabilities. However, for Karatsuba efficiency, polynomials with a degree of 2 i p are preferable, (i, p) ∈ Z 2 . As it can be noticed in Table 1, many multiplicative depths require n to be relatively distant from a power of two. Critical cases are when n is just above a power of two, like for a multiplicative depth of 4 and 7, where FFT is inefficient.
To the best of our knowledge, no particular lack of security has been demonstrated on the modulus of R-LWE instances, thus we set it up to a power of two. Usually, q is prime due to FFT. In order to demonstrate the interest of the proposed approach based on Karatsuba, we choose a multiplicative depth of 4, which corresponds to a parameter set of n = 2515 and log 2 q = 125 bits. To be as close as possible to the required n, we set the smallest sub-polynomial to degree 4, with 9 recursions of Karatsuba. That allows a polynomial multiplication of degree at most 2 i p − 1 = (4 + 1) · 2 9 − 1 = 2559. Thus, the associated irreducible polynomial can be selected in the range [2515,2560]. For n = 2560, one can find a cyclotomic polynomial with 5 coefficients, and thus the polynomial reduction can be fastly implemented. If one wants batching capabilities, setting n to 2560 allows to pack at least 64 bits in a ciphertext in a batching fashion.

PROPOSED OPTIMIZATIONS
Proposed optimizations focus on two FV primitives: FV.Mult and FV.Relin. Even if accelerating the polynomial multiplication impacts all FV primitives, an homomorphic server will mainly perform homomorphic additions and multiplications. Accelerating polynomial addition is also possible, but is not relevant compared to the complexity of a polynomial multiplication.

FV.Mult
Referring to FV.Mult, a rounding is required after polynomial multiplication. This operation can be time consuming for FFT implementations because the modulus has to be prime. For Karatsuba, it can be set to a power of two and thus the t q ( · ) operation becomes very simple, corresponding to a shift of log 2 q − 2 bits. In parallel, one can also execute the modular reduction to further optimize the operation. Finally, computing t q ( · ) q is equivalent to extracting several bits, as shown in Figure 3. The 4 polynomial multiplications in FV.Mult can also be reduced to 3 ones with the help of Karatsuba algorithm. In fact, computations of C 0 , C 1 and C 2 can be seen as a computation of sub-factors of Karatsuba. By that way, the polynomial C 1 can be expressed as:

FV.Relin
FV.Relin requires l ω,q degree n − 1 polynomial multiplications, with sub-products of size log 2 ω × log 2 q bits, and l ω,q − 1 degree n − 1 polynomial additions. Because log 2 q × log 2 q = (l ω,q × log 2 ω) × log 2 q, the number of elementary operations between the polynomial multiplication and the relinearization step are theoretically equivalent. However, FFT algorithm cannot be optimized in that way because coefficients are in a different space. Thus, subpolynomial multiplications must be performed separately. Because Karatsuba algorithm does not have such a limitation, it is possible to modify the architecture to perform the relinearization step with limited modifications. The optimization relies on two properties: 1 Product/accumulation can be performed on coefficients instead of sub-polynomials. 2 Polynomial product/accumulation required by the relinearization step can be performed on subpolynomials.
Assertion 1 can be easily demonstrated by writing the definition of the standard polynomial multiplication algorithm: for a given set of sub-polynomials C

Integer multiplication accumulation
As one can see, because each sum is independent, one can swap product/accumulation at coefficient level. For Assertion 2, it is required to expand the product/accumulation with Karatsuba:

IMPLEMENTATION
The complete implementation relies on a hardware/software co-design approach. The software runs a complete SHE library and deports some specific polynomial multiplications to the FPGA when needed. The flow chart in Figure 5 shows how the different operations are dispatched between hardware and software. The proposed architecture is composed of a CPU Intel core i7-4910MQ with 4 cores running at 2.9 GHz, connected to the hardware platform through a PCIe 3.0 with 8 lanes. The hardware platform embeds a powerful FPGA, a stratix V GX from Altera. Our accelerator is designed to fully speed-up both FV.Mult and FV.Relin, but can accelerate any operation which requires polynomial multiplication, that is to say almost all steps of FV. However, this study only focuses on FV.Mult and FV.Relin.

Software Implementation Details
The software part of FV implementation is performed using NTL library [28] compiled with GMP [29] support, 64 bits version with -O3 option. We also ported FLINT [20] cyclotomic calculation to NTL, in order to process at runtime any cyclotomic polynomial. Because the bottleneck of SHE schemes is the homomorphic multiplication, much effort has been done to optimize pre-and post-computations of Karatsuba i ← r , j ← f 3: if i > 0 then 6:  Table 2 provides latencies of software pre-and postcomputations with various optimizations. Each line of the pre-computation section in Table 2 represents latency for one polynomial, and must be multiplied by two for a complete evaluation of a pre-computation cost during a polynomial multiplication. At the opposite, the "fully-threaded" approach in line 7 processes two precomputations at a time with 6 threads in parallel. For a deeper analysis of software performances, results are provided for two different Karatsuba recursions. As one can see in lines 1 and 2 in Table 2, pre-allocation of sub-polynomials is crucial due to the size of ciphertexts. i ← r , j ← f 3: if i > 1 then 6: POST COMPUTATION(i 0 , j 0 ) 7: POST COMPUTATION(i 0 , j 1 ) 8: POST COMPUTATION(i 0 , j 2 ) 9: end if 10: Because Karatsuba is constructed with recursive calls (lines 9, 10, 11 of Algorithm 1), there is no data dependencies between calls. Thus, multi-threading can possibly be used to reduce latency. Based on our experiments, applying multi-threading beyond the first recursion of Karatsuba is counterproductive. Moreover, threading each sub-calls of PRE COMPUTATION (line 6 in Table 2) is not efficient. To further reduce pre-computations latency, an optimized version of the presented algorithm has been designed and implemented, which reduces at the same time pre-allocation and latency. If one carefully examines lines 6 and 7 of Algorithm 1, these steps duplicate a given polynomial into two sub-polynomials. This is counterproductive because no operation is performed. To avoid this issue, we added a few extra parameters to the PRE COMPUTATION function in order to give the index and the number of coefficients instead of duplicating them. Figure 4 presents the proposed optimization, where dashed polynomials represent polynomials which are duplicated during the initial algorithm. This strategy saves 66% of memory and reduces latency by 13% compared to the basic one (line 2 and 4 in Table 2). Finally, the whole optimizations reduce computation time by 62% for the pre-computation. We also try various optimizations to reduce computation time of post-computations, however no particular method has given sufficient results to be implemented in the final design, except multi-threading which is very efficient for this step. This implies that extra post-computations in hardware may be a good alternative to improve performances of the overall acceleration.

Hardware Implementation Details
Our hardware implementation of Karatsuba is based on a DE5-450 Terasic platform with an Altera Stratix V GX (5SGXEA7N2F45C2) FPGA. The DE5 platform is plugged as a peripheral of a computer and communication between the software and the hardware is done through PCIe. The advantage of Karatsuba algorithm is that it can be scaled upon the available hardware resources. If one has limited hardware resources and can only compute polynomial of relatively small degree, the software part can compute extra pre-and post-computations at a cost of a higher computation time. However, if too many pre-and postcomputations are performed in software, total computation  time can become higher than a pure software polynomial multiplication. As stated before, for n = 2560, our Karatsuba setup requires 9 Karatsuba recursions, with smaller sub-polynomials of degree 4. In order to be competitive, 6 recursions are made in software, and the 3 remaining in hardware. With this setup, 2 · 3 6 = 729 sub-polynomials of degree 2560/2 6 − 1 = 39 are sent to the accelerator, corresponding to 29160 coefficients. Figure 6 provides a high-level overview of the hardware accelerator, where input sub-polynomials are named P and Q. The accelerator has been designed to perform the operation on sub-polynomials as soon as they arrive. Thus, transfer latency is completely hidden during Karatsuba computations. After the pre-computation and the pre-crossbar, the accelerator generates several lines of sub-polynomials, which are multiplied in parallel. The post-crossbar and the postcomputation perform the reconstruction of the polynomial before sending it through the PCIe. In the following, an architecture of Karatsuba with degree 3 sub-polynomials instead of 4 is presented in order to simplify the comprehension, and all intermediate pipeline stages are not represented for the same reason.

Bus constraint
Because the bus is based on a PCIe 3. our setup, coefficients have a size of 125 bits and so the bus size is sufficient to send two coefficients at a time. However, implementing a 125 × 125 bits multiplier is not efficient in practice, that is why all elementary operators have been serialized in the proposed architecture. Because embedded DSPs are optimized to perform a 27×27 bits integer product, we decided to split coefficients into 27 bits parts. This choice allows 9 inputs/outputs simultaneously and gives some flexibility to implement multiple Karatsuba operations in parallel. The remaining bandwidth is also beneficial for FV.Relin, giving the possibility to send relinearization keys γ during the transfer of the polynomial to be relinearized, avoiding to store them temporarily in hardware. In the following, an add/subtraction/multiplication operator is considered to be serialized, with a carry propagation.

Pre-computation
The pre-computation step is the first block of Karatsuba accelerator and must be applied to the two input polynomials P and Q, preferably simultaneously in order to limit temporary storage of polynomials. This is also an important step because this stage determines the parallelism of the design. Our implementation is based on a recursive structure where PCIe in pre-computation P pre-computation Q pre-crossbar P pre-crossbar Q X X X X X X X X post-crossbar post-computation PCIe out the elementary unit is visible in Figure 7(b). Because coefficients are sent by ascending order and because we need to add low order coefficients to high order coefficients, a FIFO is implemented to temporary store first arrived coefficients. This leads to two outputs: the first one is just a copy of the input and refers to polynomials P L and P H of Karatsuba algorithm, when the second one refers to the polynomial P L+H . For the next recursion, a pre-computation needs to be applied to P L , P H and P L+H . This can be easily performed by implementing a pre-computation unit on each output of the first unit. A minor modification of the FIFOs is required, because input polynomials are halve sized compared to the first unit. This approach has been applied once more on the proposed accelerator to implement 3 Karatsuba recursions.
As it can be easily noticed, the more recursions we perform, the more outputs the unit has and, so, the more parallelism is reached. For 1 recursion, the unit has 2 outputs, 4 outputs for 2 recursions and so on. It is also important to notice that each output creates a valid sub-polynomial which can be multiplied with the related output of the other precomputation unit. However, because the branch P L+H creates a valid polynomial only half of the time, many outputs are used inefficiently. Figure 7(a) shows this phenomenon. That is why a scheduling is implemented to reorder subpolynomials and reduces the number of outputs, as it will be explained in Figure 8.

Pre-crossbar
After the pre-computations, 8 outputs are generated because 3 recursions of Karatsuba are applied, and so 8 subpolynomial multipliers are required if no optimization is done. This infers an idleness of multipliers of 1 − 3 3 2 3 · 2 3 = 58% which is not efficient. The sub-polynomials reordering is performed by a simple crossbar because the precomputation is deterministic. Figure 8 presents the strategy adopted for the scheduling of sub-polynomials, requiring 4 polynomial multipliers instead of 8. The new idleness of multipliers is 1 − 3 3 2 3 · 2 2 = 16% which is much more acceptable. In order to reduce even more this idleness, implementing extra pre-computations in hardware provides more flexibility to efficiently schedule sub-polynomials. Table 3 recaps the minimum number of outputs for a given number of Karatsuba recursions in hardware, according to our architecture. As it can be noticed, when sufficient precomputations are performed in hardware, the multipliers usage can tend toward 100%, at a cost of a complex crossbar.

Serial polynomial Multiplier
Implemented degree 4 polynomial multipliers are based on the standard polynomial multiplication algorithm. In order to be able to send polynomials without interruption, a fullparallel design is implemented and requires 5 serial integer multipliers in parallel. By doing that, the accelerator can benefit to the full potential of PCIe and its high throughput. Figure 9(a) presents the elementary operations required for a polynomial multiplication of degree 3. Each column of the elementary operations section represents the output of a serial integer multiplier. The polynomial multiplier itself is split into three distinct parts. First, a scheduling 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29  out [1] out [2] out [3] P : Re-scheduled polynomial  of input coefficients is performed. Then, a coefficient-wise multiplication is performed using serial integer multipliers. Finally, a reconstruction step, which consists on additions between coefficients, is applied. When sub-polynomials are sent successively, the polynomial multiplication produces output coefficients when the next input polynomials are received. This implies to carefully manage the output of serial integer multipliers in order to avoid any overlapping. To address this issue, a demultiplexer is implemented just after serial integer multipliers, and dispatches coefficients into two branches. To further reduce the complexity of the architecture, demultiplexers also set to zero their outputs when no coefficients are sent, allowing to implement very simple elementary units during the reconstruction step. Figure 9(b) shows the proposed polynomial multiplier. This architecture is very flexible and can be scaled upon the size of sub-polynomials.

Serial integer multiplier
As stated before, embedded DSPs are optimized for 27×27 bits integer multiplications, so coefficients are split in 27 bits segments. This infers that coefficients are divided into 125/27 = 5 parts. Similarly to the serial polynomial multiplier, serial integer multipliers are based on a standard multiplication approach. This conducts to very close architectures, as it can be shown in Figure 9(c). The main difference relies on the management of a carry propagation between intermediate coefficients. Now, one needs to decide if the t q ( · ) operation is done at this point or later. By implementing it now, remaining computations are performed on smaller coefficients and so it reduces hardware consumption. By implementing it later, this operation can be scheduled more efficiently, or can even be computed in software. Because this operation is very simple in our setup, the reduction is executed following the integer multiplication as it can be seen in Figure 9(b). Section IV discusses some cases where implementing the reduction just after the integer multiplication is not necessarily the best choice, especially for higher multiplicative depths.

Post-crossbar
Because a scheduling has been applied on sub-polynomials after the pre-computation, a reverse scheduling is required to realign sub-polynomials before post-computations. However, this step requires much more storage than the precrossbar because all sub-polynomials must be aligned with the most delayed one during the pre-crossbar. Two strategies can be used here. One consists on implementing a complex crossbar, producing directly well aligned outputs, like during the pre-crossbar. Another strategy consists on implementing successive stages of post-crossbars and post-computations in order to realign as less as possible sub-polynomials. This can lead to reduce storage requirements because many polynomials are already well aligned for a given recursion, considering that only 33% of subpolynomials have been moved during the pre-crossbar.

Post-computation
Post-computation follows the same approach than precomputation and is constructed by a recursive architecture. Figure 10(a) shows elementary operations required for a post-computation stage, and Figure 10(b) the related architecture. As it can be noticed, much more operations are required compared to pre-computations, comprising six coefficient additions and 14 coefficient subtractions.

Adapting FV.Relin in hardware
Several modifications to our design are required to use Karatsuba for relinearization. First, one needs a precomputation and a pre-crossbar for each relinearization key. Second, the integer multiplier needs to be adapted. However, no modifications are needed for post-computations. By sending polynomials as before, the FV.WordDecomp ω,q operation is already done. Because the first relinearization key must be multiplied by the first segment of the polynomial to be relinearized, the second key with the second

Inputs
Elementary operations Outputs REG.
X X X X X Serial integer multiplier segment and so on, relinearization keys are not sent at the same time but shifted, as shown in Figure 11(a). Like for the standard polynomial multiplier, both architectures of the polynomial multiplier and the integer multiplier are visible in Figures 11(b) and 11(c). The polynomial relinearizer multiplier is quite similar to the standard one except of the left operand operations that are performed on each relinearization key. The integer multiplier has now as many inputs as the number of relinearization keys, and FIFOs are added after DSPs in order to realign coefficients as it can be seen in Figure 11(a) at the bottom. By adding simple switches before and after DSPs, the serial polynomial multiplier and the polynomial relinearizer multiplier can share the same logic, limiting as much as possible hardware resources consumption.  larger polynomial multiplication, these two designs are functionally equivalent. Indeed, to multiply two degree 2560 polynomials, one needs a 8192-FFT or a 4096-NWC. The only difference is the fact that our design supports the batching technique. However, even in the case of the NWC, our accelerator reduces computation time by 23% for the homomorphic multiplication in FV, and reduces ALMs by 57%, registers by 46%, embedded memory by 99.95% and DSPs by 30%. This is due to the fact that our accelerator is hardware/software co-designed and some computations are executed in software, when FFT requires an autonomous implementation in hardware. The large memory saving is also the consequence that our accelerator runs as a flow and so does not require to store large banks of coefficients.

REG.
REG. from the CPU load, extra post-computations in hardware may be a good solution.

Comparison to Software Implementation of FV
Recent work on pure software implementation of FV in [22] provides very promising computation times for low level multiplicative depths. For parameter set (n = 4096, log 2 q = 168), authors of [22] can achieve an homomorphic multiplication in 7.68 ms. They implement a full  [22] set ω to 62 bits, inferring a larger modulus but a much smaller relinearization key. Compared to the FFT hardware implementation in [17] with parameters set (n = 4096, log 2 q = 125), the pure software solution is two times faster. Compared to our approach, because our design can use the batching technique and not the optimized software implementation, our accelerator remains interesting. To allow batching, the NWC requires to double the size of the FFT. Thus to fairly compare the two approaches, software timing results for parameters (n = 8192, log 2 q = 168) would be required. Due to the lack of software implementation results, we just provide an estimation of computation time based on the NTT complexity provided in [22]. Because NTT has a complexity of O(n log 2 n), increasing the NTT from size 4096 to 8192 increases the complexity by a factor of 2.16. Thus, we can estimate the computation time of the full RNS software implementation with batching to 7.68 ms × 2.16 = 16.58 ms, and so our accelerator remains competitive. Moreover, the size of ciphertexts is smaller in our case due to a smaller ω, but also because polynomials degree is smaller. Furthermore, several optimizations can be made on our accelerator, in particular on the software part, in order to further improve its competitiveness.

Scalability of the Proposed Accelerator
Our implementation results demonstrate that for the proposed homomorphic scenario, that is to say circuits with multiplicative depths up to 4, our accelerator reduces both computation times and hardware resources on the FPGA compared to the FFT. However, a main concern is to evaluate the scalability of the architecture for higher multiplicative depths. Due to the asymptotic complexity of the FFT, it is clear that Karatsuba will fail in competitiveness after a certain degree. To estimate that degree, we have implemented various configurations of our accelerator until matching to an existing FFT implementation both in terms of hardware resources consumption and computation time. We found a turning point of our Karatsuba approach for degree 6144 polynomials with 512 bits coefficients. With such parameters, an FFT using the batching technique must be of size 16384 with 512 bits coefficients. Table 5 provides implementation results of our accelerator for parameters set (n = 6144, log 2 q = 512) compared to FFT implementation in [17] with parameters set (n = 16384, log 2 q = 512). As one can see, the hardware resources consumption is equivalent with comparable computing time. The main limit to Karatsuba scalability is clearly the relinearization. Compared to the hardware computation time of the polynomial multiplication, the relinearization takes 3 times longer. Indeed, due to the limited bandwidth of the PCIe, we are not able to send the complete relinearization key. Because the PCIe is equivalent to a 250 MHz bus with 256 bits width on the FPGA side, and because our polynomials coefficients are split into 27 bits segments, we can only send 9 polynomials in parallel (9 × 27 = 243 < 256). Thus, when the number of relinearization sub-keys exceeds 8, we need to start again the hardware relinearization process with the remaining subkeys. For parameters set (n = 6144, log 2 q = 512), the number of relinearization sub-keys is 19, requiring 3 hardware relinearizations. The software computation time of post recursions is also an important issue, but can be compensated by additional efforts on the software part.

Pros and Cons of Karatsuba Compared to FFT
As settled in Section 4, Karatsuba can be more efficient than the FFT for both computation time and resources utilization for specific parameters. Karatsuba has several advantages compared to FFT, despite its highest asymptotic complexity. First, Karatsuba is a simple algorithm, with basic preand post-computations and so can be easily implemented. Second, Karatsuba can perform polynomial multiplications with non-power of two degrees, allowing to fit more precisely to the required parameters. Table 6 provides examples of polynomial multiplications achievable with Karatsuba. Third, the modulus can be freely selected compared to FFT, reducing the complexity of the division and rounding operation required by the FV scheme to a simple shift. Moreover, the division and rounding in the FFT case is reported to be an important bottleneck in [17]. Fourth, thanks to the use of Karatsuba, several computations can be hidden during transfers. For our accelerator, the sub-polynomials multiplication performed on the FPGA is hidden by the transfers through the PCIe. Fifth, the relinearization can be efficiently adapted to Karatsuba as explained in Section 3.2. Karatsuba has also some limitations. First, Karatsuba requires a software/hardware co-design approach to meet competitive computation times, which is not the case for FFT. Second, as stated in Section 5.3, Karatsuba is a good alternative to FFT only until a certain degree. We estimate this degree to 6144 subject to change if improvements are made on Karatsuba or FFT implementations. Third, because the polynomial multiplication degree achievable by FFT is often over-sized for a given multiplicative depth, changing the multiplicative depth only requires to change the modulus, assuming that the degree does not exceed the size of the FFT. For Karatsuba, each multiplicative depth requires a specific configuration, inferring a substantial modification of the hardware accelerator to change the lowest sub-polynomial multiplication degree.

CONCLUSION
In this paper, we demonstrate that for some cases, especially when the polynomial degree is just upper than a power of 2 and less than 6144, Karatsuba algorithm can be a good alternative to FFT. The study provides a complete implementation of a software/hardware co-design approach of Karatsuba for degree 2560 polynomials with 135 bits coefficients, allowing homomorphic operations on the FV scheme for algorithms with a multiplicative depth up to 4. We also provide information on the scalability of our approach and an estimation of the degree when Karatsuba becomes less efficient than FFT. Compared to previous state of the art contributions, and especially implementation in [17], our accelerator can perform an homomorphic multiplication of FV in 11.9 ms, when a functionally equivalent design using FFT requires about 15.46 ms for a multiplicative depth up to 4, and halves the hardware resources consumption. Moreover, our approach goes in the right direction considering that recently published Homomorphic Encryption schemes have a lower polynomial degree than previous ones [1]. Future work will consist on evaluating the proposed solution to a more constraint architecture. We will also investigate how to improve the design scalability.