Functional Verification of Large-integers Circuits using a Cosimulation-based Approach

ABSTRACT


INTRODUCTION
Large-integer arithmetic is a set of operations like addition, multiplication, modular reduction, etc that involves integers larger than the native word size of the general purpose processors, typically, 64-bit. Depending on the target application requirements, integer operands may have 163-bit, 192 bit, 512-bit, 1024bit of length, and more. One place where large integers are used is cryptography, especially in the public-key family like RSA [1] and Elliptic Curve Cryptography [2], [3]. Large integers are also used in complex research, high performance computing (HPC) and computational algebra. Large integers operations know a continuous development in mathematical algorithms [4][5][6]. Hardware-based implementations of such algorithms have proved to be more efficient than equivalent software's programs in terms of speed and resources usage. This is mainly due to exploring new design architectures [7][8][9][10][11][12]. Such designs are generally written in hardware description languages (HDLs).
To verify that a design works as intended, two technologies are commonly used; Simulation-based verification and formal verification. The simulation-based verification is the technique generally used for complex designs. Formal verification, which consists in mathematically checking the functional correctness of the design, is generally used to verify small designs and corner cases. However in the last decade, formal verification tools have seen their capacity to verify more complex designs improved to some extent, in part, because of its coupling with simulation (dynamic formal verification) and the standardization of some  Figure 1. The Control flow controls the process of verification through the platform. It fixes the settings; i.e. the parameters of the blocks constituting the platform, delay times, sampling times, etc. Data flow represents the transformations that data undergoes, starting from the generation of large-integer operands, passing through the input adapter, the DUV, the output adapter and, finally, entering the comparison/checking blocks. The third flow, Verification flow, verifies the functional correctness of the DUV. We chose to build the verification flow around two complementary simulationbased verification approaches: testbench and assertion-based verification. We guide the testbench via a verification structure with considerations of a coverage plan. When testbench is launched, outputs of DUV and reference model are compared. Results are then transferred to a Scoreboard to be analyzed. We write assertions in Property Specification Language (PSL) [24], standard assertion language, inside the DUV and represent a precise description of the DUV's behavior. Note that we chose PSL for property description as it‗s in widespread use in industry and compatible with many hardware description languages. PSL assertions are checked by the HDL simulator during the simulation. The assertions verification results (pass/fail) are also sent to the scoreboard to be analyzed and new stimuli are generated in the next testbench according to the updated functional coverage.

The Functional Verification Process
The purpose of the -Functional‖ verification process is to verify that the DUV matches its specification. This process should verify that the implemented functions behave correctly. The verification technology used is the simulation-based verification, more precisely a cosimulation between Matlab/Simulink and ModelSim, and simulated assertions written in PSL.
Globally, we followed a coverage-driven random-based verification approach. The level of verification can be of unit/sub-unit or cores/blocks level and two simulation-based verification techniques are used jointly, depending on the partition of DUV being verified. In fact, a common practice in the integrated circuits design community is to divide designs into datapath and control logic ( Figure 2). Because of their differences, appropriate verification schemes can be applied to each. Datapath units which involve largeintegers processing can be verified using the Matlab/Simulink testbench where large-integers are supported as will be detailed in the next section. Datapath usually consists of uniform arrays of cells, such as bits in a register file, slices in an adder and so on. The remaining logic is regarded as control logic. An advantage of using HDL cosimulation with Matlab/Simulink testbench is the possibility of cross-level datapath verification, as will be more detailed later. This means that data's output of different hierarchical level can be probed and compared in run-time against Matlab models. On the other hand, control logic can be accurately specified by properties and assertions, and thus is verifiable using PSL. The DUV's control logic is specified by a set of proprieties written in PSL assertions.
The verification structure is the set of Matlab Function Blocks within the platform in charge of the verification plan. Figure 3 represents the architecture of the -Verification structure‖. The latter is divided in two block sets, connected to form a loop with the rest of the platform. The first set is composed of stimuli generation blocks while the second is composed of analysis blocks (-Comparator‖, a -Checker‖ and a -Scoreboard‖). Within the first, the Data output of the DUV (Z_DUV), is verified against the Reference Model output (Z_Ref), the result of the comparison is transferred to the Scoreboard. According to the feedback, the first set generates new stimuli corresponding to the next coverage step and/or to the next property to be verified. The DUV is here a modular arithmetic operation. The objective is to ensure that DUV matches its specification.
The functional verification approach is a white-box approach (i.e. the HDL design is known to the verifier). Because of sampling time difference, control signals and data were assigned to separate blocks ( Figure 3). Both blocks are driven by a block called -Testbench Scenario Update‖. The role of the latter is to update control signals and data according to verification coverage. Inside the Data block, a Random Number Generator (RNG) produces constrained random largeinteger data (X,Y operands) using a VPI seed value. An overview of the functioning of the platform's blocks is given in Table 1. The comparator receives data from DUV and Reference Model, converts it to VPI data type, makes the comparison, and finally transfers the result to the Scoreboard. Sub-DUV Checker checks the functional correctness of internal operations of DUV. It receives the probed metadata from the DUV, converts it to VPI data type, calculates the internal operation bit-wise as a reference operation, makes the comparison and finally transfers the result to the Scoreboard. As illustrated in Figure 3, the comparator and the Sub-DUV Checker update the scoreboard with results as long as the cosimulation is going on. The control logic verification, not represented in Figure 3, is done with PSL assertions simulated in the HDL Simulator. new_seed=Compute_seed(feedback(i)) //constraint on seed to hit specific data interval Do{ CRLI_1 = randint(vpi(new_seed)); (*) // CRLI = Constrained Random Large-Integer CRLI_2 = randint(vpi(new_seed)); OK = Verify_seed (new_seed, CRLI_1,CRLI_2); } // to verify that RLI is constrained as desired While (NOT (OK)) // RLIs do not match the desired interval.

Large-integer Data Processing
Large-integers data processing is an important part of the platform. Processing is carried out in Matlab, Simulink and simulated hardware. We assume that the platform, shown in Figure 4, verifies the operation f: Z = f(X,Y), where: X, Y and Z are three large-integers. Control signals are reset and start. Done is an output signal that indicates the end of the DUV's operation. Simulation control is handled by Simulink.
The co-simulation stage (stage 5 in Figure 4) contains the Reference Model and the DUV. The Reference Model is the DUV's equivalent model written in Matlab inside a Matlab Function block. The DUV is represented by the HDL Cosimulation block. The DUV's simulator (ModelSim) is launched and linked to Simulink using a Matlab code based on a TCL script. When communication is established, the simulator functions as the server and Simulink as a client. The HDL simulator responds to simulation requests it receives from the Simulink Client. The communication between the HDL Simulator and Simulink is done through the HDL Verifier™ tool. The maximum length of integer data types supported by HDL Cosimulation Block in Simulink is 128-bits. To work around this limitation, the DUV was masked in an HDL wrapper that stacks the received data frames into a logic vector that matches the input data size of the DUV and vice-versa for the output data. Two kinds of adapters were used in the platform (Frontend and Backend adapters). The first one adapts data and control signals received in Matlab/Simulink formats to DUV supported formats. Within this stage, each data matrix is converted into a sequence of scalars using the Simulink's block -Unbuffer‖. The Unbuffer unbuffers an M-by-N input into a 1-by-N output ( Figure 5(a)). That is, inputs are unbuffered row-wise so that each matrix row becomes an independent time-sample in the IJECE ISSN: 2088-8708  Functional Verification of Large-integers Circuits using a Cosimulation-based Approach (Nejmeddine Alimi) 2197 output. As example, a 192-bit data fits into a 24-by-8 matrix, and the Unbuffer Block will unbuffer the 24by-8 input into an 8-bit length vector. Then, each data is converted to standard logic vector via the -Data Type Conversion‖ block. The Backend adapter adapts data and signals from DUV to the Comparator and Checker blocks. In this stage, the HDL block output data is re-buffered into a decimal matrix using a Simulink block -Delay Line‖. The latter performs the reverse task of the -Unbuffer Block‖, rebuffering a sequence of Mi-by-N matrix inputs into a sequence of Mo-by-N matrix outputs ( Figure 5(b)).  An attention should be given to the reading time of the -Delay Line‖ output so that the data matrix can be read entirely. In fact, the DUV wrapper, detailed in a latter paragraph, was designed to send each of Z_DUV frames at every clock's positive edge starting from instant when the output signal -done‖ is on and according to the -Delay Line‖ Block functioning, the entire matrix representing Z_DUV can be read by the Comparator Block at time T calculated in formula 2: Where: Time (done=1) is the time when done is set to ‗1', Nbr of Z frames is the total number of Z frames outputted from HDL Block, and T (Z sample period) is the sampling time of Z. The Figure 6 illustrates the communication between Simulink and the DUV via the wrapper. As shown, Inputs (reset, start_frames, start, X, Y) are stimuli from Matlab/Simulink, while Outputs (Done, Z_DUV, Probe1… Probe n) are results sent back to Matlab/Simulink for comparison and internal checking. The start_frames is an extra input to the DUV Wrapper to control the reception of frames from Data block.
The role of the Wrapper is to handle a cycle-accurate transfer of data between Simulink and the DUV without modifying the latter's description. The wrapper , written in VHDL, is based on an Input converter and two Output converters. One dedicated to DUV's result, the other to internal signals (Figure 7).

2199
-Output Converter‖ module receives the result, bits are added to the logic vector to bring it to the required size (f k+1 to f n frames) (-padding‖ operation). Then, the logic vector is sent in w-bit frames to the next stage (Figure 8(b)). Similarly, the -Debug Converter‖ module brings DUV's internal signals to the next stage. Figure 8. The DUV's Wrapper units.

Platform Control, Settings and Execution
An essential side of the platform is the control and settings. Platform control consists in controlling the execution of the testbench by scheduling the stimuli to the DUV/Reference Model and the outgoing signals/data to be verified. The challenge here is to synchronize the Matlab/Simulink blocks, which are inherently untimed, with an RTL-level design running in an event-based simulator (ModelSim). The Platform Control process is abstracted in the timed finite state machine (TFSM) represented in Figure 9. Because Matlab Function Block, and Matlab language in general, is untimed, the timing and the delivery of the data is controlled by the HDL simulator (when Matlab Function Block is located after the DUV) and/or by the Block's sampling time setting (when Matlab Function Block is located forward). Using a Simulink Digital Clock, the stimuli (control signals and data) are generated in specific simulation times. The transition delay times between the TFSM states are presented in Table 2.
Platform settings are the settings of parameters related to each block of the platform. That is, the Simulink blocks parameters (Unbuffer, Delay Line, etc.) and the sampling times for Matlab function blocks (Verification structure blocs). A graphical user interface (GUI) was developed to facilitate this task.
In addition to the choice of Simulink blocks and algorithms inside Matlab Function blocks, the functioning of the platform relies on the timing settings. In fact, for each block, a sample time needs to be specified. In Simulink, the sample time of a block is a parameter that indicates when, during simulation, the block is active and if appropriate, updates its internal state. For HDL cosimulation Block, a sample time has Where UBI is -Unbuffered Block Input‖ As UBI is connected to Data block, the value of Data block's sample time is the same as that of UBI. Because the Unbuffer only accepts fixed-size input, output of Data sub-block cannot be set to variable size type. Therefore, data bit-size (operands) has to be set manually by user for different operands bit-size. In practice, Sample time can be set automatically by working around the restriction of the Unbuffer. To do so, the size of all data transferred between blocks in the platform are chosen as a constant that holds all standard sizes of finite-field commonly used in Public-key Cryptography (for instance, 1024). Therefore, the Number of data matrix rows is also a constant (fixed size-input) and the Sample Time of UBI, becomes only DUV's Clock dependent. The choice of a unique size simplifies the transmission/reception of data inside each block and only bits corresponding to operands real size (e.g. 192-bit) are used. This task is done by the Wrapper as it brings operands to the required size by the padding and unpadding processes previously detailed. User has to place the VHDL design code(s) inside a specific folder for co-simulation, adjust wrapper's parameters to the size of the HDL design operands and connect wrapper's probes (outputs) to desired DUV's internal signals. The Output Data Adapter, represented by the Simulink block -Delay Line‖, executes the reverse task of the Unbuffer Block. That is, it transforms a sequence of data frames into a matrix.
When verification is launched, PSL assertions test results are processed in Matlab workspace. Then, results are carried to Scoreboard along comparator and checker results. The Scoreboard computes the new verification coverage and generates a feedback summarizing the coverage. According to the feedback, a new testbench scenario and parameters targeting the uncovered assertions and/or datapath logic not yet verified are set. Then, the next testbench will be ready to run.

Verification Platform with FPGA in-the-loop
Another aspect of reusability of the proposed platform is the possibility to switch from HDL cosimulation to real hardware testing while keeping the same verification platform. This option was tested with the -Hardware-in-the-loop‖ (HIL) option provided by Simulink for FPGA boards equipped with Gigabit Ethernet port (an Altera DE2-115 board with Cyclone IV EP4CE115 FPGA was used). This way enables controlling and verifying a design (a modular multiplier, more details in section 3) running on FPGA from the Matlab/Simulink platform with the design's real execution time ( Figure 10). However, this came at a cost as that internal verification (Sub-DUV Probing) becomes inaccessible due to the FPGA development's tool restrictions on design's coding style and wrapping.
To conclude this section, Table 3 gives a comparison between the present work and similar works from literature. The Table shows that the proposed platform while sharing some features with other works (supported HDL, cosimulation, etc.), stands out with more powerful HVL, unrestricted large-integer support and adaptability to HIL.

CASE STUDY & RESULTS
As case study of the platform, we consider the operation Z = f(X,Y) , where X, Y and Z are three large-integers. Control signals are reset and start, while Done is an output indicating the end of the operation. The goal is to evaluate the cost of the bit-size, the number of assertions and the internal signal probing on the platform.

Large-integer Arithmetic Background
Large-integer arithmetic has a variety of applications in cryptography. Among these, AES, RSA and ECC. As illustrated in the Figure 11, ECC schemes are based on Point operations, primarily on the point multiplication and also on the operations on which it point multiplication relies, i.e. point addition and doubling. In turn, those point operations are made on finite-fields arithmetic, a particular field of largeintegers. This implies that finite-field arithmetic are determinant to design an efficient elliptic curve cryptosystem. Finite-field arithmetic is the arithmetic of integers modulo a large prime p. Arithmetic in a finite-field is different from standard integer arithmetic and all operations performed in the finite-field result in an element within that field. Three kinds of fields that are used for efficient implementation of ECC systems are prime fields (F p ), binary fields (F 2 m ), and optimal extension fields (F p m ). Those fields were extensively studied and this has resulted in numerous algorithms. Finite-field arithmetic is a practical example of large-integer arithmetic usage and is the cornerstone of cryptographic schemes such as ECC.

The DUV
A hardware implementation of a Finite-field multiplication algorithm called the -Double, add, and reduce‖ (DAR) multiplier [27] was used as a DUV. The DAR multiplier is based on the Interleaving Multiplication Algorithm [28]. Given a k-bit natural x and a natural y the product z = x . y can be computed as follows formula 6: x.y = (x k-1 2 k-1 + x k-2 2 k-2 + … +x 0 2 0 ).y (6) The latter can also be expressed as in formula 7: x.y = (… ((0.2+ x k-1 y )2+ x k-2 y)2+…+ x 1 y)2+x 0 y If all operations (addition and doubling) are executed mod m, the result is product = x .y mod m. The corresponding (left to right) algorithm, written in ADA syntax, is presented in Listing 1. The function -mod_m_addition(x, y, m, k)‖ computes x + y mod m ; x, y, and m being k-bit numbers, according to the binary mod m Addition. This unit of the datapath represents an internal large-integer operation. In the design, operands were set to recommended sizes (192,384,512,1024) for cryptographic use by the NIST [29]. The datapath and a part of control logic corresponding to the hardware description of Algorithm 1 are shown in Figure 12. The DUV is an ideal case for the platform testing with internal large-integer operation and a distinct control units and datapath. In practice, in addition to the functional validation (comparing DUV against reference model), each partition modules were verified. For datapath, -Mod m Adder‖ module was the target of internal checking while the control logic units were verified with PSL assertions.

Tests, Results and Discussion
The DAR multiplier was verified with the platform on an 32-bit Intel Pentium Dual-Core processor (2,5 GHz, 2GB RAM, 2MB cache memory). The goal is to measure the cost over time of the bit-size, the number of assertions and the number of probes. A campaign of tests was carried out for each parameter.
Detailed execution times of the platform as a function of operands sizes were measured using Simulink's Profiler. The results are presented in Figure 13. The latter shows that the total recorded time increases quite linearly with operands size but is still acceptable for an operation on 1024-bit. For the four bit-sizes, the "HDL co-simulation block" occupies a small portion of the execution time (between 2.81 % and 5.18%) and remains rather constant. As can be seen in the same figure, the total recorded time is dominated by the initialization in average sizes (192 and 384). This aspect decreases in larger sizes in favor of the group of blocks " data, Scoreboard, and testbench Updater" reaching ≈40% of the total recorded time for 1024-bit size. This can be explained by the fact that the time used to the guided generation, adaptation and transmission of data stimuli increases with bit-size. It should be noted that in order to get a correct measure of bit-size cost, data across the platform was transmitted/received in the exact bit-size without using the padding/unpadding operations and related automated settings. concluded that the PSL assertions increase the execution time but the more the number gets larger (> 100), the more the impact on time is limited. It should be noted that the -steep slope‖ aspect of the curve is due to the high steps taken, on the horizontal axis, from the 200 th assertion.
To evaluate the impact of the number of Probes on the execution time, measurements of the platform time for a fixed DUV bit-size (1024 bit) function of the number of probes were made and the results are shown in Figure 15. In this test, the outputs of the "HDL co-simulation" block, the number of sub-blocks of the "Backend Adapter" stage and the inputs of the "Checker" were adjusted to match the corresponding number of probes. According to results, when the number of probes increases, the execution time increases linearly but with a low slope (α ≈ 0.4). It can be said that the number of probes increases the execution time of the verification process but does not penalize it especially because a small number of probes is generally needed for verification.
Analysis of the three tests campaign results indicates that the parameters (bit-size, number of assertions and number of probes) has only a moderate impact on the execution time of the verification platform, thus justifying its efficiency.

CONCLUSION
In this paper, we have presented a novel platform intended to verify hardware large-integer based designs, the first one based on Matlab/Simulink to the best of our knowledge. We demonstrated that the proposed platform holds a number of interesting aspects for the task of verification. First, this is run time and cycle-accurate verification. Second, flexibility, where minor adjustments in Matlab/Simulink blocks parameters, different bit-length can be verified with a moderate impact on execution time. Besides, testbench scenarios are adjustable to meet desired verification coverage where datapath and control logic can be verified simultaneously and in different level of the design hierarchy. Third, reusability: In this paper, we developed testcase on finite-field arithmetic but we also tested the platform to verify a scalar multiplication ( Figure 11) this proves that the platform is adapted to more complex systems like cryptographic primitives.
Future work will involve improvements like reducing synchronisation and data transfer overhead, limiting the complexity of the wrapping module, and enabling internal verification in HIL. Furthermore, an interesting area of application of the platform that would need further efforts is the verification of designs under development, with possibility of replacing unachieved blocks with equivalent Matlab/Simulink models. Another extension of the platform, in the field of cryptanalysis, could be using Matlab's data processing features to verify design robustness to side-channel and fault injection attacks in HIL.