A Programmable SoC Implementation of the DGK Cryptosystem for Privacy-Enhancing Technologies

Additively homomorphic encryption has many applications in privacy-enhancing technologies because it allows a cloud service provider to perform simple computations with users’ data without learning the contents. The performance overhead of additively homomorphic encryption is a major obstacle for practical adaptation. Hardware accelerators could reduce this overhead substantially. In this paper, we present an implementation of the DGK cryptosystem for programmable systems-on-chip and evaluate it in real hardware. We demonstrate its efficiency for accelerating privacy-enhancing technologies by using it for computing squared Euclidean distances between a user’s input and a server’s database. We also provide comparisons with a recent implementation of Paillier cryptosystem and show that DGK offers major speedups. This work represents the first implementation of the DGK cryptosystem that uses hardware acceleration and demonstrates that the DGK benefits greatly from the hardware/software codesign approach.


I. INTRODUCTION
Users' data is increasingly stored and processed in the cloud. While this trend has many obvious benefits, it comes with a heavy price on privacy: cloud service providers hold much of users' (sensitive) data and can exploit it, e.g., for marketing purposes. Encrypting the stored data with normal encryption would solve the privacy problems but prevent all processing and nullify most of the benefits of cloud services as a consequence. Privacy-Enhancing Technologies (PETs) are methods to protect users' privacy without sacrificing the functionality of services. Many PETs are based on homomorphic encryption, a type of encryption that allows computations with encrypted data without revealing its contents.
Additively Homomorphic Encryption (AHE) allows additions in the encrypted domain and makes it possible to implement practical PETs, particularly, when combined with certain other multi-party computation technologies. HW implementations of cryptography offer both speed and security improvements compared to software (see, e.g., [1]). AHE introduces significant performance overheads and, hence, improving its performance with HW acceleration may substantially improve the efficiency of PETs and help their practical adaptation. So far, only few works have considered HW acceleration of AHE schemes and PETs; notable exceptions are [2], [3] that study acceleration of Paillier cryptosystem [4] using Field Programmable Gate Arrays (FPGAs).
In this paper, we focus on the Damgård-Geisler-Krøigaard (DGK) cryptosystem [5], [6]. It is an AHE scheme that provides fast encryption and decryption combined with small ciphertext sizes, but provides a smaller plaintext space compared to many other AHE schemes, most notably to Paillier cryptosystem [4]. From the computational point-of-view, DGK requires modular arithmetic with large integers similarly as Paillier cryptosystem but with smaller operand sizes. We propose an implementation of DGK in a Xilinx Zynq-7020 programmable System-on-Chip (SoC) using a multi-core design for large integer arithmetic reported in [3] and earlier used for Paillier cryptosystem. We show the suitability of the DGK implementation for PETs by studying privacy-preserving computations of Squared Euclidean Distances (SEDs) between a user's input and entries of a server's database.
We provide the following contributions: • We present an efficient implementation of DGK in the HW/SW codesign originally introduced in [3] for Paillier cryptosystem. To the best of our knowledge, this is the first published hardware implementation of DGK. • We show that DGK and our implementation can be efficiently used for privacy-preserving computation of SEDs, which are commonly used in PETs. • We compare DGK and Paillier AHEs and conclude that DGK provides significantly faster encryption, decryption, and SED computations. The remainder of this paper is organized as follows. Section II presents the preliminaries on DGK cryptosystem and privacypreserving SEDs. Section III introduces the architecture of the HW/SW codesign and our implementation of DGK and SEDs. Section IV presents the results and analysis of the implementations, and finally, we end the paper by drawing conclusions in Section. V. primes v p and v q and an -bit prime u. Then, choose two κ/2-bit primes p and q so that v p | p − 1 and v q | q − 1 as well as u | p − 1 and u | q − 1. Then, compute N = p · q.
Choose g ∈ Z * N of order uv p v q and h ∈ Z * N of order v p v q . The public key is pk = (N, g, h, u) and the secret key is sk = (p, q, v p , v q ). The parameter sizes should be chosen so that factoring a κ-bit N is hard, t is chosen based on the logarithm of the size of the subgroup of Z * N , and defines the plaintext space. We use κ = 2048, t = 224, and ∈ [16,22] in this paper, as we target the 112-bit security level. • Encryption. Take a message m ∈ Z u and a public key pk = (N, g, h, u) as inputs and select a 2.5t-bit random r, as instructed in [6]. Then, compute and return the ciphertext: • Decryption. Take the ciphertext c ∈ Z * N and the secret key sk = (p, q, v p , v q ) as inputs and compute: If m = 1, then m = 0. To decrypt other values, keep computing g vpi for i ∈ Z u until g vpi = m , then m = i. This works because g vp has order u and there is a one-toone correspondence between values of m and g vp mod p [5] and it is computationally feasible since Z u is small. This process is denoted by m = Dec(sk, c).
Decryption can be carried out by using a precomputed Decryption Look-Up Table ( In fact, it suffices to store truncated δ i = lsb λ (δ i ) where lsb λ (δ i ) denotes the λ Least Significant Bits (LSBs) of δ i and λ is selected as the smallest value ensuring that all δ i are unique. In our case, we use D where λ is a multiple of 8 (a byte) and entries (δ i , i) are sorted by δ i to ensure fast searches. We consider two decryption variants: • a fast variant that computes m with (2) and finds (δ i , i) such that δ i = lsb λ (m ) and directly returns m = i. • a safe variant that before returning m = i computes m = g vpm mod p and verifies that m = m . This removes the possibility of a corrupted ciphertext c returning a valid decryption result due to the truncation. Let m denote the DGK encryption of a plaintext m. The DGK cryptosystem is additively homomorphic under multiplication modulo N and, therefore It directly follows from (3) that homomorphic multiplications by a scalar k can be computed via exponentiation: DGK has the advantage that encryption, decryption, and homomorphic operations are faster than for Paillier because operations are carried out in Z N instead of Z N 2 . Another advantage is that ciphertexts are smaller by 50% (Z N vs. Z N 2 ). The fact that the plaintext space is significantly smaller (Z u vs. Z N ) can be a disadvantage as will be discussed later.

B. Privacy-Preserving Distances for PETs
Let x = (x 0 , x 1 , . . . , x m−1 ) be a query vector and Y be a database consisting of n vectors of length m: Y = (y 0 , y 1 , . . . , y n−1 ), where y i = (y i,0 , y i,1 , . . . , y i,m−1 ). W.l.o.g., we assume that x j , y i,j ∈ Z 2 τ and use τ = 4 and τ = 8 as examples in this paper. Fig. 1 shows a use case scenario where Y is in the possession of a server and x is a query sent by a user. The user wants to find out which y i is closest to x under some distance metric. Both parties want to keep their inputs secret from the other party.
To ensure the secrecy of x, the user encrypts x j with his/her public key pk before sending them to the server: x = ( x 0 , x 1 , . . . , x m−1 ) = (Enc(pk, x 0 ), Enc(pk, x 1 ), . . . , Enc(pk, x m−1 )). The server calculates encrypted distances d i between x and each vector y i in Y by utilizing the homomorphic properties of the encryption scheme. Typical applications do not use the distances as such, but require the indices of the k smallest distances, i.e., the k nearest neighbors (kNN). The indices cannot be retrieved directly with AHE. The kNN search is implemented, e.g., so that the server masks the encrypted distances with random masks and Yao's garbled circuits [7] are used for removing the masks and finding the Algorithm 1: A straightforward algorithm for computing the encrypted middle-terms Δ i,2 for squared Euclidean distances (SEDs) [3].

1) Squared Euclidean Distances:
The SED is a distance metric that is commonly used in published PETs, e.g., for privacy-preserving fingerprint matching [9], face recognition [8], [10], indoor localization [11]- [13], and user matching [14]. It can be decomposed into three terms in the following way [10]: The terms Δ i,1 and Δ i,3 depend only on the inputs of one party. Consequently, they can be computed in the plaintext domain and then encrypted with pk. Because Δ i,1 is the same for all distances d i , the user sends only one value Δ 1 . The middle term Δ i,2 requires inputs from both parties. The user sends −2x j for j = 0, . . . , m − 1 and the server then uses (3) and (4) to compute: Finally, the server adds the encryptions of Δ 1 , Δ i,2 , and Δ i, 3 : 2) Algorithms for SEDs: The straightforward and optimized algorithms for computing (6) for i = 0, . . . , n − 1 were presented in [3] and they are shown in Alg. 1 and Alg. 2, respectively, where MM and ME stand for Modular Multiplication and Modular Exponentiation, respectively.
The straightforward algorithm in Alg. 1 computes each d i separately by using (6) in a direct manner. The advantages are Algorithm 2: An optimized algorithm for computing the encrypted middle-terms Δ i,2 for squared Euclidean distances (SEDs) [3].
simple control and low memory footprint. The disadvantage is that the same exponentiations are repeated several times when y i,j are the same for many values of i. The optimized algorithm in Alg. 2 solves the disadvantage of Alg. 1 by scanning the server's database in the other direction and multiplying x j k to all distances for which y i,j = k. This approach is particularly efficient compared to Alg. 1 when y i,j are small and n is large. The disadvantages are the more complicated control and parallel processing as well as larger memory footprint.

A. High-Level HW/SW Codesign
This section shortly revisits the HW/SW codesign accelerator presented in [3]; more details are available in [3]. The overall architecture is shown in Fig. 2. It divides into HW and SW sides so that the computationally heavy long integer modular arithmetic is performed by the HW side (FPGA) and controlling of the HW side and auxiliary operations are performed by the SW side. The architecture is generic and suits for different programmable SoCs with minor modifications. In this paper, we use a Xilinx Zynq-7020 [15] in ZedBoard [16] as the implementation platform.
Typically AHE computations in PETs (i.e. SEDs computations) include a lot of inherent parallelism. For this reason, the HW side is a multi-core architecture that includes multiple parallel and programmable Cryptography Processor (CP) cores that are designed to have a good balance between performance and area requirements. The CP cores are arranged into M clusters, each including N CP cores. All blocks in the HW side are connected in an Advanced Extensible Interface (AXI)based structure. Each CP core can be individually programmed via microcode updates.  The HW/SW codesign includes a multi-level memory structure to overcome the bottleneck of data communication between the CP cores and the SW side. The three-level Data Memory (DMEM) is divided as follows. The HW side has Level-1 DMEM (L1-DMEM) in each CP core and Level-2 DMEM (L2-DMEM) that is shared for all CP cores. The SW side includes Level-3 Memory (L3-MEM) that consists of both on-chip and off-chip memory (i.e., DDR3 in ZedBoard). Each CP core also includes Instruction Memory (IMEM) for storing microcodes loaded into the CP core from L3-MEM by the SW side. Data communication between memory levels uses High Performance (HP) AXI interfaces. Also, the General Purpose (GP) AXI interfaces are applied for transferring commands and status between the SW and HW sides (see Fig. 2). The SW side is responsible for controlling the HW side and external peripherals. Specifically, the SW side performs the high-level control and managing of the execution-flow of the specific computations. These controlling operations include sending and receiving data and microcodes to/from the CPs, issuing commands to the CPs, offline and online programming of the CP cores (by the microcodes) and other modules in the HW side, receiving the status of the CPs and other modules from the HW side, and making control decisions based on the received status.

B. Cryptography Processor
The CP core is an efficient programmable processor for large integer modular arithmetic optimized for the resources of modern FPGAs (e.g., for DSP slices, BRAMs, etc.). The CP core is based on a micro-programming architecture which provides both flexibility (for parameter sizes) and programmability (for different algorithms) combined with a small area footprint. The architecture of the CP core is shown in Fig. 3.
The CP core contains an external interface unit, an arithmetic unit, a data memory unit (L1-DMEM), a control unit, and an instruction memory unit (IMEM). The arithmetic unit contains Modular Multiply-Add Accumulator (MMAA) and Modular Adder/Subtractor (MAS) blocks for computing mod-  ular arithmetic. The inputs and output of the arithmetic unit are connected to L1-DMEM, which stores data that is required during an algorithm run. Two words can be read and one word written simultaneously from and to L1-DMEM to facilitate efficient modular arithmetic. The microcodes stored in IMEM are sequences of instructions for the units of the CP core. Each instruction consists of different fields such as arithmetic, control, next IMEM address, DMEM address values, DMEM, and IMEM fields. These fields apply all required controlling signals for the units for a working cycle of the CP core. The microcodes are generated by hand through a customized platform and scripts. The external interface unit is the top module and a wrapper for the other units in the CP core architecture. The main tasks of this unit are receiving/sending command/status from/to external module(s), supporting AXIbased read and write interfaces with the SW side, supporting read and write interfaces with the shared L2-DMEM, and controlling the other units of the CP core.

C. Implementation of Target Applications
In this section, we describe details of implementing the target applications using the HW/SW codesign. These implementations are carried out with software updates for the SW side and microcode updates for the HW side, and do not require any modifications to the HW/SW codesign described in [3], consequently, showing the power of its flexibility.
1) DGK Encryption and Decryption: DGK encryption is computed completely in the HW side (i.e., FPGA side) with consecutive Modular Multiplications (MMs) and Modular Exponentiations (MEs). The SW side controls the overall computation process as well as data and microcodes transfers between the SW and HW sides.
DGK decryption consists of two phases: (1) precomputation and (2) main computations. The precomputation is performed only once for each key and it constructs the DLUT and then sorts and stores it in L3-MEM. The main computations consists of two or three steps for the fast and safe variants, respectively. First, an ME is performed in the HW side for calculating m using (2). Second, a binary search with lsb λ (m ) is performed from the DLUT in the SW side to obtain m. Third, for the safe variant, the validity of m is verified with an extra ME in the HW side.
2) Squared Euclidean Distance: The SED computation consists of two parts: first, the middle-terms of (6) using Alg. 1 or Alg. 2 and, second, the final distances of (7). a) Alg. 1: A distance between the user's input x and a vector y i in the server's database is computed in a single CP core. Each CP core computes consecutive MEs and MMs (i.e., lines 3-5 of Alg. 1) for a specific i and then two MMs to compute (7). I.e., the computation proceeds in a row-wise manner where Y is seen as a matrix with n rows (vectors) and m columns (vector elements). All computation happens in the HW side and the SW side performs simple control and data transfer tasks. b) Alg. 2: The CP cores operate in a mixed columnand-row-wise manner. First, they operate column-wise so that columns (i.e., different iterations of the for-loop in line 2) are assigned to different CP cores. Each CP core computes MMs (i.e., lines 3-7) for a specific column j. According to Alg. 2, different CP cores (different j) must contribute to the same t i . This is implemented so that each CP core has a local copy of t i until the end of the column-wise processing, after which they are combined with MMs in a row-wise manner. Second, the CP cores compute (7) in a row-wise manner similarly to Alg. 1. Obviously, the SW side now performs more control and data transfer tasks but the HW side performs fewer computations.

IV. RESULTS AND ANALYSIS
To evaluate the performance of DGK and to demonstrate the efficiency of the implementation for acceleration of PETs, we implemented it on real hardware. We targeted Xilinx Zynq programmable SoCs and specifically Avnet ZedBoard Zynq Evaluation and Development Kit [16] that includes a lowcost Xilinx Zynq-7020 xc7z020clg484-1 [15]. The target chip includes a dual-core ARM Cortex A9 (the SW side) and an Artix-7 FPGA (the HW side). For the SW side, we used C++ and Xilinx SDK for developing software for a Real-Time Operating System (RTOS). For the HW side, we used Verilog and Vivado 16.3. The resource requirements of the HW side in Xilinx Zynq-7020 device are summarized in Table I. The clock frequencies for the FPGA and ARM are 122 and 667 MHz, respectively. Based on Vivado, the total power consumption is about 3.2W. The following results are final post-place&route results and validated with real hardware. Table II show the encryption and decryption timings. Encryption that requires operations modulo N (κ = 2048 bits) is expectedly slower than decryption that operates with smaller modulus p (κ/2 = 1024 bits). As most of the delay for encryption is in the HW side, multiple encryptions can be performed in parallel by utilizing the N · M CP cores of the multi-core architecture. However, the SW side delays dominate decryptions making parallel processing more difficult. Table II presents decryption results for a 16-bit u. As will be seen below, this precision is too small for certain SED computations that require up to 22-bit precision. The effect of increased precision to timings is small: encryption and decryption times increase only about 1 % and 12 %, respectively, for = 22 (and λ = 48) compared to Table II  Similarly to [3], Table III  . Additionally, we provide results with two database precisions τ = 4 and τ = 8 for y i,j ∈ Z 2 τ ; [3] considered only τ = 8 but this precision requires an up to 22-bit precision ( = 22) for the distances d i . Hence, we provide results also for the smaller precision τ = 4 which allows using a 16-bit plaintext space for all database variants and also has practical relevance as shown, e.g., in [13].

B. Comparisons
Tables II and III also collect results from [3] to provide comparison between DGK and Paillier. Because they are implemented on the same HW with the same HW/SW codesign, we can make easy and fair comparison between the two cryptosystems. The tables show that DGK is significantly faster: Encryption is more than 12 times faster, decryption speedups are from 34 times (for the safe variant) up to 72 times (for the fast variant), and SED computations are about 2 to 3.5 times faster. These speedups are mainly due to smaller operand sizes compared to Paillier cryptosystem. Also the memory consumption of DGK-based SED computation is smaller thanks to the smaller ciphertext sizes.
Although these numbers suggest that DGK is clearly superior to Paillier, the situation is not as undisputed in reality. The small plaintext space of DGK prevents its use in applications requiring high(er) calculation precision. We already saw this limitation because we had to increase the precision up to 22 bits for our large datasets leading to a large DLUT of about 19 MBs of RAM for the specific u that we used. The small plaintext space also prevents packing of several ciphertexts, which is a technique that applies to Paillier cryptosystem and reduces communication overhead and the number of decryptions that need to be computed by the receiver. Hence, the decryption speedups calculated directly based on Table II  TABLE III  PERFORMANCE CHARACTERISTICS OF THE SQUARED EUCLIDEAN  Multi-core HW/SW codesign system (12 CP cores design: M = 6 and N = 2) Squared Euclidean distances (SEDs) based on DGK # 1 (Straightforward) DGK # 2 (Optimized) Paillier # 1 Paillier # 2 ( Alg. 1 + (7) ) ( Alg. 2 + (7) ) (Straightforward alg.) [3] (Optimized alg.) [ may not fully reflect the speedups in real applications. On the other hand, packing is a computationally intensive operation that increases the load of the sender (the server) [3]. Finally, we emphasize that the precision offered by DGK (e.g., 22 bits) is enough for many practical applications and in those cases it offers major benefits.

C. Security Model
Because PETs are the primary target applications, we consider a security model where the adversary is the other party of the protocol. I.e., an adversarial user aims to find out the server's database Y and an adversarial server aims to find out either (a) the user's secret key sk or (b) the user's input x (note that (a) implies (b), but not vice versa). We assume that DGK and the protocol are secure and focus on information leakage from the implementation. Furthermore, we assume that the adversary lacks physical access to the other party's computation platform (also via malware) and limit our analysis to remote timing side-channel attacks. Thus, Table II includes results also for Constant Time (CT) variants. The basic modular arithmetic (multiplication, addition, and subtraction) is CT by default, but MEs has two versions: square-and-multiply (non-CT) and square-and-multiply-always (CT).
Although Enc(pk, m) does not use the secret key sk, it should still be CT to prevent information leakage about m via a timing channel. A CT encryption follows directly from CT MEs. It also suffices to use CT MEs to protect Dec(sk, c) from leaking sk. However, timings of DLUT searches may leak information about m. Table lookups are notoriously difficult to make fully CT in SW; while it is easy to make a search (in our case binary search) with a constant number of iterations, memory hierarchy of modern processors (with caches) may incur data-dependent timing variations. Arguably, this leakage is not significant enough in our setting to compromise x for a number of reasons. First, the server may measure the timing between sending the ciphertext (see Fig. 1) and the beginning of the following phase (typically the garbled circuit phase). This timing, therefore, sums n decryptions and prevents from collecting information about individual x j . Second, the ciphertexts contain distances that include only indirect information about x. Third, the server cannot cheat by using fabricated database entries or incorrect calculations to increase the dependencies between d i and x because the user would immediately notice it as a poor quality of the service, and so on. Nevertheless, timings of DLUT lookups should be carefully considered for each particular application.
An adversarial user wants to find out Y by exploiting the timing of computing SEDs. Again, the user can measure only the overall timing of computing all distances and, hence, can make estimates, e.g., on the density of Y and the sum of Hamming weights of all y i,j . It is evident that this leakage is not enough to compromise Y or to construct Y ≈ Y which functions similarly. Even this small leakage can be prevented by using a CT-variant of Alg. 1 by using CT MEs for computations with y i,j . The cost depends particularly on the density of Y but also on how y i,j are distributed in Z 2 τ .

V. CONCLUSIONS
We presented an efficient implementation of DGK on a programmable SoC and demonstrated its feasibility for accelerating privacy-preserving computation of SEDs, an important operation for many PETs. To the best of our knowledge, this was the first reported implementation of DGK that uses HW acceleration. We also provided comparisons between two AHE schemes: DGK [5], [6] and Paillier [4] cryptosystems.
DGK benefits greatly from the HW/SW codesign paradigm. The HW side can efficiently accelerate modular arithmetic (especially, exponentiations) and privacy-preserving SED computations can efficiently utilize the parallel processing capabilities of the multi-core architecture. Implementing DGK decryption efficiently only in HW would be difficult because it requires large DLUTs or, otherwise, brute-force searches through the plaintext space must be done. Hence, efficient decryption requires interplay between the HW and SW sides.
The comparison between DGK and Paillier cryptosystems showed that DGK is significantly faster: Encryption is more than 12 times faster, decryption speedups are 34-72 times, and SED computations are about 2 to 3.5 times faster. Although the numbers are clearly faster for DGK than for Paillier, the small plaintext space of DGK may become an issue in practice.
To summarize, we showed that DGK cryptosystem is suitable for efficient HW/SW codesign implementation. It provides a good alternative for the more traditional Paillier cryptosystem used in many PETs. If the application is such that the small plaintext space does not become an issue, then significant speedups can be achieved by using DGK instead of Paillier and accelerating it with our implementation.