Compact and Programmable yet High-Performance SoC Architecture for Cryptographic Pairings

Cryptographic pairings are important primitives for many advanced cryptosystems. Efficient computation of pairings requires the use of several layers of algorithms as well as optimizations in different algorithm and implementation levels. This makes implementing cryptographic pairings a difficult task particularly in hardware. Many existing hardware implementations fix the parameters of the pairing to improve efficiency but this significantly limits the generality and practicality of the solution. In this paper, we present a compact and programmable yet high-performance architecture for programmable system-on-chip platforms designed for efficient computation of different cryptographic pairings. We demonstrate with real hardware that this architecture can compute optimal ate pairings on a Barreto-Naehrig curve with 126-bit security in 2.18ms in a Xilinx Zynq-7020 device and occupies only about 3200 slices, 36 DSPs, and 18 BRAMs. We also show that the architecture can support different types of pairings via microcode updates and can be implemented on other reprogrammable devices with very minor modifications.

On the implementation side, efficient implementations have been presented both in software [18]- [22] and hardware [23]- [37], the latter including both Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs). Pairings are very complicated operations including multiple layers of algorithms (e.g., [22] utilizes 31 algorithms to compute an optimal ate pairing) and efficient pairing computations require careful choices of parameters and algorithmic tricks. Consequently, their implementation is notoriously difficult and laborious, especially, in hardware.
While software provides natural flexibility and allows support for multiple pairings as well as easily updating pairing algorithms, hardware is significantly more rigid. Although fixing parameters leads to a more efficient implementation, it may come with a significant penalty in practical feasibility because it reduces flexibility regarding types of pairings and curves and hinders the adaptation of new algorithms. In theory, flexibility could be provided with reprogrammable hardware but, in practice, it may be hard because pairings are complicated algorithms and designing separate implementations for all pairing types and parameter sets would be a daunting task. Hence, there is a clear need for flexible highperformance implementations that can be used for efficiently computing different cryptographic pairings. Typically, pairings are only a part of a cryptosystem and also other operations must be supported by an implementation in order to realize the cryptosystem (e.g., identity-based encryption, searchable encryption, or functional encryption schemes). Hence, an implementation of pairings should be compact and achieve a good speed-area tradeoff.
Hardware/Software (HW/SW) codesign paradigm is suitable for pairing computations and their use in larger cryptosystems because complicated control flows can be implemented in software while still receiving the benefits of hardware acceleration effectively with efficient yet compact accelerator cores. This is particularly due to the fact that complicated state machines required for controlling complex pairing computations are easy and efficient to implement in software whereas they incur significant area overheads in hardware. A HW/SW codesign is also scalable in the sense that it can be extended with additional cores for parallel pairings and/or other operations needed by the cryptosystem. In this paper, we will focus on programmable System-on-Chip (SoC) platforms (e.g., Xilinx Zynq SoCs) that realize the HW/SW codesign paradigm with hardwired processors (typically ARM cores) and reprogrammable hardware (i.e., FPGAs).
To keep the discussion concise and clear, we focus particularly on optimal ate pairings [16] over BN curves [17] and the specific parameters used by Beuchat et al in [22]. Nevertheless, we emphasize that the implementation is generic and can be used for implementing various pairings on different curves.
In this paper, we provide the following contributions: • We describe a compact programmable SoC architecture for cryptographic pairings that achieves high performance and very good speed-area tradeoff. The architecture is optimized for the resources of modern reprogrammable SoCs such as DSPs, BlockRAMs, and hard ARM cores.
• The architecture supports microcode updates that can be used for supporting different cryptographic pairing algorithms, curves, and other parameters with the same accelerator architecture. This makes our architecture significantly more viable for practical deployment than hardware implementations with fixed parameters. • We evaluate the proposed HW/SW codesign system on real hardware using Avnet Zedboard including a Xilinx Zynq-7020 programmable SoC chip and showcase the above-mentioned benefits.
The rest of this paper is organized as follows. We briefly survey the relevant algorithmic background in Section II. We present the architecture of our implementation and the computation procedures in Section III followed by results and analysis in Section IV. Finally, we end the paper by drawing conclusions in Section V.

II. PRELIMINARIES OF PAIRING
A cryptographic pairing is a bilinear map G 1 × G 2 → G 3 where G 1 and G 2 are additive groups and G 3 is a multiplicative group. In the context of optimal ate pairings on BN curves, G 1 and G 2 are additive groups of points on elliptic curves E(F p ) and E(F p k ) and G 3 is the multiplicative group of F p k . The parameters must be chosen so that discrete logarithms in all three groups are infeasible; e.g., for approximately 128-bit security level, we need a 256-bit prime p and k = 12.
The algorithm for computing an optimal ate pairing over BN curves is given in Alg. 1. The two main operations in the algorithm are the Miller loop in lines 2-5 and the final exponentiation in line 9. The former consists of elliptic curve arithmetic in E(F p 2 ) and line evaluations in F p 12 that can be interleaved. The latter is an exponentiation in F p 12 that can be decomposed into f (p 6 −1)(p 2 +1)(p 4 −p 2 +1)/r , of which the two first terms can be efficiently computed with Frobenius operators and conjugations. The last term is called the hard part and is computationally the most demanding part.
In [22], F p 12 is represented as a tower extension field with the following irreducible binomials: Consequently, arithmetic operations in the above fields are computed with series of operations in F p . In particular, the Karatsuba-like construction allows multiplications in the quadratic extension fields F p 2 and F p 12 to be computed with three multiplications (and additions/subtractions) in the underlying fields F p and F p 6 , respectively. Multiplications in F p 6 require six multiplications in F p 2 [22]. Lines 3, 5, 7, and 8 are computed using formulae from [18], [38]. Line 6 requires only three multiplications and two negations in F p . Line 9 follows the ideas of [39] and consists of multiple low-level algorithms and optimizations.

III. ARCHITECTURE AND IMPLEMENTATION
In this section, we present our architecture for pairings using the HW/SW codesign approach in a programmable SoC. Most of the existing hardware-based pairing implementations have focused on maximizing the speed at the expense of resource utilization and programmability. The few flexible designs that support different pairings and parameters are significantly slower. The HW/SW codesign approach allows an efficient tradeoff combining high performance with low resource usage and flexibility. This is particularly true for pairing computations where the main difficulty in this respect is the high number of different algorithms that must be supported but where computations mostly rely on the same low-level operations (i.e., F p arithmetic).

A. High-Level HW/SW Codesign
Our architecture is constructed as a generic HW/SW codesign and can be instantiated in various programmable SoCs with minor modifications. However, in this paper, we consider mainly instantiations in Xilinx all-programmable SoCs because we use Avnet ZedBoard and Xilinx ZCU102 evaluation kits for prototyping. We will refer to the specific features of those programmable SoCs whenever such a distinction is required. Also, to provide programmability and to decrease resource utilization, the HW part of our architecture uses a microprogramming appraoch instead of implementing hardwired Finite State Machines (FSMs) for the specific algorithms of pairing computations. Because microprogramming provides flexibility, scalability, and programmability combined with a small area footprint that would be hard to achieve with specific FSMs in hardware.   . 1 illustrates the high-level architecture of the HW/SW codesign which is divided into two main parts including SW and HW sides (called Processing System (PS) and Programmable Logic (PL) in Xilinx terminology, respectively). The SW side consists of ARM core(s), on-chip and offchip (i.e., DDR3) memories, and other interconnection and control. The HW side consists of Pairing Cryptography Processor (PCP) and supporting modules (i.e., Xilinx IP cores including Direct Memory Access (DMA), memory and peripheral interconnects, General Purpose Input/Output (GPIO), and processor system reset). The data and control communications between the SW and HW sides are based on the capabilities of the specific programmable SoC, and we use the Advanced Extensible Interface (AXI) High Performance (HP) and General Purpose (GP) interfaces of Xilinx SoCs. The HP interface is employed for high-performance transfer of data and microcodes, and the GP interface is used for transferring commands and status (see Fig. 1). The SW side is responsible for controlling the HW side and external peripherals. Specifically, the SW side performs the high-level control and managing of the execution-flow of the pairing computation. These operations include sending and receiving data and microcode packets to/from the PCP, issuing commands to the PCP, offline and online programming of the PCP (by the microcodes) and other modules in the HW side, receiving the status of the PCP and other modules from the HW side, and making control decisions based on the received status. As shown in Fig. 1, all modules in the HW side are connected in an AXI-based structure. The high-performance data and microcodes communication between the SW side and the PCP of the HW side is done via the HP x interface that connects to the AXI memory interconnect block which further connects to the PCP core via an AXI DMA block. Furthermore, the command and status communication is handled via the AXI peripheral interconnect block in the HW side. It is also used for controlling the AXI DMA block used for high-speed data and microcodes communications.

B. Pairing Cryptography Processor (PCP)
The cost of a pairing computation is generally expressed by the total number of required field operations (i.e., multiplications, additions/subtractions, constant-multiplications, and inversions). Moreover, the efficiencies of the architecture and the scheduling technique of field operations are the main factors that determine the overall performance of a pairing implementation [37]. The main objective in designing the PCP is to achieve a good trade-off between programmability, speed, and area requirements and to efficiently utilize the resources of modern FPGAs (e.g., DSPs and BRAMs) in implementing base field arithmetic (i.e., arithmetic in F p ). Because the tower extension field arithmetic is ultimately based on F p arithmetic, this allows us to efficiently implement different arithmetic operations in F p 2 , F p 4 , F p 6 , and F p 12 (tower field arithmetic). Fig. 2 depicts the architecture of the PCP, which contains external interface, arithmetic (datapath), control, Data Memory (DMEM), and Instruction Memory (IMEM) units. The external interface unit is used for command, status, data, and microcode communication with the external modules. The IMEM contains a 1024 × 72-bit simple dual-port RAM and a controller for different address branch scenarios. IMEM stores microcodes for algorithm(s) that are run in the PCP. Each instruction in the microcodes consists of several fields that apply the required commands to the corresponding units for a working cycle of the PCP. The IMEM is partitioned into  32 segments (i.e., 32 × 2.25Kb = 72Kb), where each segment can be loaded separately via the external interface unit during the runtime. In addition, full microcode loading of the IMEM can be done by the SW side directly during the runtime.
The control unit generates addresses for DMEM and makes decisions for loop iterations and conditional statements. The inputs and outputs of the arithmetic unit are connected to DMEM, which stores data that is required during an algorithm run. DMEM is a duplicated 1024×256-bit true dual-port RAM with two independent read and write ports and supports "4read", "2-write", or "2-read and 1-write" operations from/to DMEM. This facilitates efficient scheduling and parallelization of F p arithmetic. DMEM is also interfaced with the external interface unit for communicating data with the SW side.

1) Arithmetic Unit:
The datapath is shown in the top right corner of Fig. 2 and it supports arithmetic in F p with arbitrary up to 256-bit primes p (i.e., up to 128-bit security). It consists of three parts: source registers, arithmetic blocks, and output selectors. The arithmetic blocks comprise three Montgomery Modular Multiplier Blocks (MMMBs) and two Modular Adder/Subtractor Blocks (MASBs) and they can operate in parallel and independently of each other. The inputs of all arithmetic blocks can be loaded from DMEM but the inputs of MASBs can be additionally loaded from the outputs of the arithmetic blocks. This arrangement together with the multiread/write feature of DMEM allows efficient computation of tower extension field arithmetic. E.g., F p 2 multiplication requires three F p multiplications, which can be computed as follows: The modulus p and the precomputed Montgomery constant p are registered into the arithmetic unit. a) MASB: The structure of MASB with a two-stage pipeline is also illustrated in Fig. 2. Addition and subtraction in F p can be realized by two consecutive adder/subtractor circuits which produce the result in two cycles. Due to the pipeline, its throughput is one F p addition/subtraction per cycle. Applying two MASBs and connecting the outputs of the datapath back to its inputs facilitates efficient field arithmetic operations such as F p 2 addition/subtraction/negation, F p 2 multiplication/squaring, and multiplications by small constants. b) MMMB: Fig. 3 shows the structure details of MMMB for computing F p multiplications/squarings. It contains three nested parts which are organized bottom-up as a Multiply- MAAB consumes most of the FPGA resources, has the highest dynamic power consumption, and also contains the critical path of PCP. In order to maximize its efficiency, it is implemented using the DSP slices. In the next part, MAAB is complemented with an accumulation operation (i.e., MAAAB). The lower part of the MAAB result is accumulated with the previous higher part as well as with the previous most significant bit of the accumulation result (i.e., the input carry). The output carry and the higher part of the MAAB result are stored for the next accumulation (see Fig. 3). The latencies for computing r low and r high are five and six clock cycles, respectively. This accumulation method and the one clock cycle difference between r low and r high are essential for efficient implementation of high-radix Montgomery modular multiplication algorithm [40]. Finally, in the top part, MAAAB (as the main computing core) as well as multiplexers, registers, and FSMs are used for implementing radix-2 64 Montgomery modular multiplication [40]. MMMB computes a multiplication/squaring in F p with a total latency of 43 clock cycles, but a new multiplication/squaring can be started already after 38 clock cycles due to the pipelined scheme.

1) Working Principle and Scheduling of the Architecture:
The initialization step configures both the SW and HW sides of the HW/SW codesign and must be done only once for every pairing algorithm and curve parameter set. It includes loading all inputs and curve parameters into DMEM of the PCP core in the HW side. To implement a specific pairing algorithm, an indepth analysis of the algorithm is performed and all algorithms of the pairing are translated into microcodes (i.e., several segments and/or full sub-routine packs). The microcodes are sequences of instructions for different units of the PCP core. Each instruction set consists of fields such as arithmetic, control, next IMEM address, DMEM address values, DMEM, and IMEM fields. These fields apply all required controlling signals for the units for a working cycle of the PCP core. The microcodes are generated by hand through a customized platform and scripts. In this architecture, each instruction set has 72-bit length and it is divided to 14 fields. The microcodes are stored in the (off/on-chip) SW side memory (i.e., DRR3). Whenever a (set of) particular computation(s) needs to be executed in PCP, then the corresponding microcode(s) are loaded into IMEM by SW side through the external interface unit, as explained before. According to the aforementioned explanations, details of the memory taxonomy in the SW and HW sides, SW/HW interaction principles, and the PCP instruction set format are described in Fig. 4. Obviously, the efficiency of microcodes for computing tower extension field arithmetic greatly determines the overall performance of a pairing computation and, therefore, special care should be taken in scheduling operations and maximizing the utilization of the datapath for these operations. Fig. 5 illustrates how to efficiently implement and schedule the tower extension field arithmetic (from F p to F p 12 ) on the BN 126 curve [22], which is the main focus of this paper. On the top, it shows how to maximize the usage and scheduling of the datapath for F p arithmetic (i.e., multiplications/squarings, additions/subtractions) by utilizing parallelism and pipelining. The datapath effectively hides the costs of additions/substractions as they can be computed simultaneously with multiplications and this can be utilized for efficient computation of tower field arithmetic. In the middle, Fig. 5 depicts the realization of F p 2 arithmetic using F p arithmetic and shows that a new F p 2 multiplication/squaring can be computed after 38 clock cycles and, also, that up to eleven F p 2 additions/subtractions can be done during each F p 2 multiplication/squaring. The F p and F p 2 operations are further used for implementing F p 4 , F p 6 , and F p 12 arithmetic (addition, subtraction, negation, constant-multiplication, multiplication, squaring, exponentiation, and inversion).
2) Optimal Ate Pairing Computation Steps: Implementation of optimal ate pairing algorithm (Alg. 1) in our HW/SW codesign consists of three levels. The first level are the im- plementations of F p and F p 2 arithmetic discussed above. The second level consists of elliptic curve doubling/addition steps, line evaluation functions, Frobenius operators, and F p 12 arithmetic operations, which utilize the optimized and interleaved algorithms as well as tower extension field arithmetic hierarchy. Finally, the third level controls the high-level operations of Alg. 1, which are Miller loop (lines 2-5), Frobenius operators and final addition steps (lines 6-8), and final exponentiation (line 9). They are efficiently realized on the HW/SW codesign by using the algorithms from [22].
In the final exponentiation, an F p 12 inversion is required. Thanks to the tower extension field arithmetic, it can be decomposed into several additions/subtractions, multiplications, squarings, and a single inversion in F p , which dominates the computational cost. We compute the inversion in F p with Fermat's Little Theorem that gives a −1 ≡ a p−2 mod p. We compute this exponentiation using the right-to-left square-andmultiply algorithm because it allows computing multiplications in parallel with squarings and results in more efficient implementation. Because the cost of multiplications is hidden, an inversion costs only log 2 p − 1 squarings. All operations are implemented as different full and segment microcode packs. The entire implementation of Alg. 1 contains 8 full and 24 segment packs. Miller loop, Frobenius operators and final addition steps, and final exponentiation consist of 1/12, 1/2, and 6/10 full/segment packs, respectively. Alg. 1 is executed in the HW/SW codesign so that, first, all microcode packs are stored in the SW side memory (i.e., DDR3). Then, for executing a specific computation step, the related full pack is loaded directly to IMEM. Then, the related segment packs are stored into the Auxiliary RAM block of the external interface unit (which can hold up to 16 segment packs). Whenever these segment packs are needed, they are loaded to the IMEM by the external interface unit. Loading of the full packs is done by the SW side, but the segment packs are loaded internally by the HW side without interaction with the SW side (see Fig. 4). The details about latencies of full and segment microcode packs transfers are reported and analyzed in the next section. It should be noted that the latencies of preparing and sending/receiving each command/status in the SW and HW sides are 45 and 5 clock cycles, respectively, which are small compared to the computation latencies in the PCP core. Furthermore, after initializing the PCP core, we need 26 commands and status transfers in total for an optimal ate pairing computation. The required clock cycles for transferring the full and segment microcode packs, commands, and statuses between the SW and HW sides are included in the reported times in the next section (i.e., Section IV).
Alg. 1 contains 8 parts (and as many full packs). Miller loop (lines 2-5) and lines 6-8 are the two first parts. The final exponentiation (line 9) divides into 6 consecutive parts, which are computed following the procedure in [39] and the specific algorithms from [22]. Alg. 1 over the BN 126 curve with the parameters from [22] needs, in total, 14300 multiplications/squarings in F p and only one inversion in F p .

A. Implementation Setup and Results
To evaluate the performance of the HW/SW codesign, we implemented it on real hardware using Avnet Zedboard with a low-cost Xilinx Zynq-7020. The target chip includes a dualcore ARM Cortex A9 and an Artix-7 FPGA. For the SW side, we used C++ and Xilinx software development kit for developing software for a real-time operating system (RTOS). For the HW side, we used Verilog (HDL) and Vivado for implementing the design to the FPGA. The resource requirements are summarized in Table I. The maximum clock frequencies for the FPGA and ARM are 105 and 667 MHz, respectively. Based on Vivado, the total power consumption of the chip is about 1.9W. All results are final post-place&route results and validated with real hardware, unless mentioned otherwise. Table II gives the number of clock cycles to compute different parts of Alg. 1 over the BN 126 curve from [22] in the FPGA.
To demonstrate the generality of our HW/SW codesign and its efficiency on a modern programmable SoC, we implemented it on a Xilinx Zynq UltraScale+ MPSoC ZU9EG chip featuring a quad-core ARM Cortex-A53 processor running up to 1.5GHz in the SW side and a 16nm FinFET+ based FPGA in the HW side. Such a programmable SoC platform allows a significantly more powerful instance of our HW/SW codesign to be implemented in a single chip. Furthermore, to enable fair comparisons with other pairing designs, we implemented the HW side of the SoC architecture on a Xilinx Virtex-6 FPGA device. Table III reports the performance characteristics of implementation in the three above mentioned platforms. It should be noted that the results are for optimal ate pairing implementation over BN 126 curve.

B. Related Work, Comparison, and Discussion
In this section, we consider three main categories of the pairings implementations including FPGA (and SoC), ASIC, and pure SW implementations. Also, there are many FPGA, ASIC, and SW implementations of different pairing algorithms over various curves and parameters but, for the sake of brevity, we cite only those for optimal ate pairings over BN curves that are relevant for comparisons with our implementations. Table IV shows a comparative analysis of hardware and software results of optimal ate pairings over BN curves with 126-128-bit security levels.
In the first part of Table IV for the FPGA implementations, our Virtex-6 FPGA design, which occupies only 3072 slices, 36 DSPs, and 18 BRAMs, compares favorably to other designs in respect to flexibility, scalability, programmability, and area and still offers comparable speed. Furthermore, only [36] and [41] are flexible but they are considerably slower. Flexibility is very important for HW implementations because otherwise a different HW component is needed for different pairings and parameters.
In the second part of Table IV, [28] and [26] have focused on the inflexible and customized ASIC implementations of pairing cryptographic processors. Also, [28] achieved highspeed pairing implementation in the expense of using a large area footprint. In addition, [30] proposes designing of a flexible and programmable ASIP for cryptographic pairings over BN curves in the ASIC platform but it is considerably slower. In order to compare flexibility and programmability of our HW/SW codesign with the ASIP implementation approach, we mention that our design is fundamentally a HW/SW codesign because we need the SW side, e.g., for scheduling data and packs transfers from the main memory. This allowed us to use a compact PCP core in the HW side but also provides more flexibility than ASIP approach. Of course, it would be possible to remove the SW side by extending the existing PCP core with more complex control logic (e.g., for scheduling data transfers from the main memory) similarly to an ASIP-like approach, but this would make the PCP more complex and less flexible/programmable.
Finally, in the third part of Table IV, there are SW implementations of pairings with higher speed than our flexible and programmable Hardware (HW) (i.e., FPGA) design. SW is also always flexible and programmable by nature. However, HW has many advantages besides pure speed: such as energy/op and price/core. E.g., as reported before, based on Vivado, our Zynq-7020 design consumes 1.9W whereas highperformance CPUs consume a few hundreds of Watts under full load. Comparing our PCP core and high-performance CPUs is not entirely fair because the PCP is optimized for speed/area. Besides, a single FPGA (even Zynq-7020) can have parallel PCPs to further increase the speed. Moreover, HW implementation (e.g. FPGA) is also usually easier to protect against side-channel attacks than pure SW implementation. Also, in the three parts of Table IV, there are some works that are faster but not flexible/programmable.

C. Computational Costs of Different Pairing Algorithms
It is common and effective to analyze computational costs of pairing algorithms by expressing the costs of different parts of algorithms with the numbers of F p and/or F p 2 arithmetic operations. As explained before, the costs of additions/subtractions are hidden in our datapath. Furthermore, F p 2 multiplication and squaring have the same cost due to the pipeline scheme (see Fig. 5). Hence, we can estimate the costs of different steps using only the numbers of F p 2 multiplications/squarings. Because each F p 2 multiplication/squaring contains three parallel F p multiplications/squarings (in our datapath) and the design computes an F p 2 multiplication/squaring with a latency L M = 38, we estimate that the total number of clock cycles of a pairing algorithm in our HW/SW codesign is as follows: 2T , l T,T (P )  where M p and I p are the numbers of multiplications and inversions in F p , L I is the latency of an inversion in F p , and C is the overhead of loading microcode packs. Based on our experiments, C ≈ 1.1. For the optimal ate pairing from [22] considered in this paper, we have M p = 14300, I p = 1, and L I = 11938 and (4) gives T = 212393. The measurements from real hardware show that real number is 208146 clock cycles and, hence, the estimate given by (4) has an error of about 2 %. Table V collects estimates of computational costs of different pairing algorithms in our HW/SW codesign by using (4) and the F p operation counts available in [30].

V. CONCLUSIONS
We presented a new HW/SW codesign for efficient computation of cryptographic pairing algorithms. It combines compact size, programmability, scalability, and flexibility (support for various pairing algorithms and curves) with relatively high performance. The architecture is optimized for the resources of modern reprogrammable SoCs such as DSPs, BlockRAMs, and hard ARM cores. In addition, the proposed SoC architecture supports microcode updates that can be used for supporting different cryptographic pairing algorithms, curves, and other parameters with the same accelerator architecture. We evaluated the proposed HW/SW codesign with real hardware and showed that the architecture can support different pairings via microcode updates and can be implemented on other reprogrammable platforms. We also investigated its efficiency in a high-end Xilinx UltraScale+ programmable SoC platform.
ACKNOWLEDGMENT This work has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 780108 (FENTEC).