FPGA based technical solutions for high throughput data processing and encryption for 5G communication: A review

The field programmable gate array (FPGA) devices are ideal solutions for high-speed processing applications, given their flexibility, parallel processing capability, and power efficiency. In this review paper, at first, an overview of the key applications of FPGA-based platforms in 5G networks/systems is presented, exploiting the improved performances offered by such devices. FPGA-based implementations of cloud radio access network (C-RAN) accelerators, network function virtualization (NFV)-based network slicers, cognitive radio systems, and multiple input multiple output (MIMO) channel characterizers are the main considered applications that can benefit from the high processing rate, power efficiency and flexibility of FPGAs. Furthermore, the implementations of encryption/decryption algorithms by employing the Xilinx Zynq Ultrascale+MPSoC ZCU102 FPGA platform are discussed, and then we introduce our high-speed and lightweight implementation of the well-known AES-128 algorithm, developed on the same FPGA platform, and comparing it with similar solutions already published in the literature. The comparison results indicate that our AES-128 implementation enables efficient hardware usage for a given data-rate (up to 28.16 Gbit/s), resulting in higher efficiency (8.64 Mbps/slice) than other considered solutions. Finally, the applications of the ZCU102 platform for high-speed processing are explored, such as image and signal processing, visual recognition, and hardware resource management.

communication system was proposed, equipped with Bluetooth connectivity. The system includes an AES encryption block for making secure the data transfer between two radio stations; a prototype of the developed communication system was realized employing the RC10 FPGA development board. Comparing the developed framework with other ones presented in the literature, the first has demonstrated a higher data rate and lower power consumption.
As previously discussed, the NFV is a crucial element for developing the 5G networks; Pinneterre et al. developed a new FPGA-based virtualization approach, called vFPGAmanager, that enables orchestrated acceleration of resource allocation for virtual machines, uni-kernel, and containers [31]. The proposed framework enables dynamic remote orchestration using a set of commands associated with the FPGA accelerators and virtual machine status through an innovative communication mechanism's interface. The experimental results demonstrated that the controller could serve the incoming commands from the orchestration with negligible overhead (in the worst case, it employed 18079 µs for performing 1000 commands).
Furthermore, FPGA can be exploited for characterizing massive multiple input multiple output (MIMO) channels for both the local data elaboration and the standardization of the technology [32]- [35]. In this context, Huang et al. discussed the development of an efficient geometry-based complex MIMO channel emulator, using an interactive-based method [36]. Besides, they analyzed and studied the trade-off featuring the emulator performances between resource utilization and channel accuracy. The proposed emulation approach employs an iterative framework for generating the geometry-based channels and optimizing the word length and refresh rate to reduce the FPGA's memory and hardware utilization. Furthermore, the experimental results indicated that the developed emulator could be implemented using an FPGA device (Virtex 4 VFX100), allowing to process up to 19 TX/RX antennas pairs in real-time. Also, in [37], the authors presented a 2X2 MIMO Generalized frequency division multiplexing (GFDM) transceiver, using National Instruments USRP-RIO platforms, using a Xilinx Kintex-7 FPGA, combined with Labview software, so obtaining a communication chain deployed both in hardware and software. A prototype of the developed transceiver, operating on 1.2-6 GHz frequency interval with 40 MHz bandwidth, allowed to determine the performances of GFDM, namely low latency, out-of-band (OOB) emission, and high reliability. In particular, experimental tests demonstrate that -48dBm OOB radiation has been obtained due to pulse-shape filtering applied to the sub-carrier, enabling to apply the GFDM in a highly fragmented spectrum scenario or cognitive radios framework [38]- [40]. Ayouby et al. introduced a novel combining framework for the optimal generalized diversity receiver for 5G MIMO channels, called generalized maximum ratio combining (GMRC), each represented by a single input multiple output (SIMO) and binary phase shift keying-spatially modulated (BPSK-SM) channel, obtained from a proper combination of diversity channels [41]. This work, derived from a previous implementation [42], substituted all operations into addition and multiplication, more efficiently hardware implemented by an FPGA platform. Furthermore, a pipelined framework was used in the proposed solution, resulting in a higher throughput value. The FPGA-based solution of the proposed combination scheme was efficient in terms of resource utilization and speed, reaching an operative frequency higher than 180 MHz due to the above-described solutions. Given the superiority of the GMRC scheme with respect to the MRC one and the efficient FPGA implementation, the GMRC solution is suitable for future 4G or 5G wireless MIMO receivers [43].
Finally, in this section, different scientific works involving FPGA devices in communication applications, and, mainly in 5G networks, have been analyzed and discussed. All the applications exploit the reconfigurability, high operational frequency and attitude to parallel computing offered by the FPGA platforms. Table 1 summarizes the scientific works discussed above, highlighting the application typology, employed FPGA platform, and benefits of FPGA implementation as shown in Appendix.

IMPLEMENTATION OF ENCRYPTING/DECRYPTING ALGORITHMS WITH XILINX ZYNQ ULTRASCALE+ MPSoC ZCU102 PLATFORM
In this section, the state-of-art of innovative cypher/decipherer implementations is reported, all hinged on the Xilinx Zynq Ultrascale+ MPSoC ZCU102 FPGA platform described in detail in the next sub-section 3.1. Afterwards, in sub-section 3.2, we presented our novel architecture of the AES-128 algorithm, featured by high data throughput (up to 28.16 Gbit/s) and low FPGA utilization of hardware resource (only 3262 slices), and comparing it with other similar solutions reported in the literature.
In [44], the authors introduced a multi-processor architecture to implement the computationally expensive fan-vercauteren (FV) homomorphic encryption scheme by employing both an FPGA and an ARMbased processor for carrying out several homomorphic processes in the cloud. They used the Halevi, Polyakov, and Shoup optimization techniques for reducing the computational load due to parallel polynomial multiplication algorithm with high-precision arithmetic. Specifically, a Xilinx Zynq UltraScale+ MPSoC ZCU102 board has been deployed for developing and testing the proposed programmable architecture.  Through parallel computation cores and block-level pipeline methods, a 200 MHz operative frequency was  reached, thus achieving a data rate of 400 homomorphic multiplications per second, thirteen times faster than  the heaviest optimized software implementation on the Intel i5 processor. A dataset partitioning solution based on the FPGA platform has been presented in [45], to parallel execute software programs. To benchmark this architecture, the authors have used four different applications: a 256-bit AES encryption algorithm, a hotspot application for thermal simulation, a NBody application to measure the particle behaviours under the influence of a force, and a general matrix multiplication (GEMM) application multiplying two 1024x1024 floating-point values. The dataset was distributed into fixed-length blocks and, at each iteration, several data chunks were entrusted to a computing unit, and only that subset can be loaded to the unit. Filter classes and pipeline derived by Intel Threading Building Blocks (Intel TBB) were employed to implement parallelism and a two-stage pipeline. A serial filter constitutes the first stage for calculating the data chunk size and allocates it to the following idle computing unit. The second stage was a parallel filter to simultaneously communicate and process the data chunks in all the computing units. The obtained results demonstrated that the developed architecture could provide up to 86.23% of the throughput achievable for the AES application, up to 82.50% for hotspot application, up to 94.06% for GEMM application, and up to 111.51% for NBody application.
In [46], the authors presented an AES-GCM architecture with efficient utilization of digital signal processor (DSP) slices and block random access memory (BRAM) tiles, implementing Drimer's round-based architecture, for performing both AES and the multiplication in 10 cycles. Furthermore, a fully unrolled pipelined architecture, employing the AES Tbox approach, was reported to carry out the AES encryption and the multiplication in 1 cycle and append the GCM mode of operation with optimized GF (2 128 ) (Galois field) multiplier. In the round-based architecture, the sequence of SubBytes and Shift-Rows steps were exchanged, and the 128-bit blocks of data were split into 1-byte chunks, applied to the Shift-rows function. To derive the Sbox outputs and multiplied version of SBox outputs, each byte was provided to its corresponding Tbox. Each column was constituted by four Tbox adjacent outcomes, combined to obtain the corresponding MixColumns output. In the last round, all the Tboxes generated the Sbox output and the MixColums function was skipped. In the unrolled pipelined architecture, the operation flow was the same as the round-based, with all the rounds implemented in an unrolled pipelined modality for faster execution with more area utilization. The results demonstrated that the round-based architecture uses 899 LUT, 1036 FF, 139 BRAM, and 685 DSP. In contrast, the unrolled pipelined architecture employs 785 LUT, 1043 FF, 17.5 BRAM, and 72 DSP, resulting in fewer resources usage.
Kim et al. have proposed the SafeDB (spark acceleration on FPGA clouds with Enclaved data processing and Bitstream protection), a complete and systematic security framework for the confidential bitstream data provided by the on-cloud applications [47]. The framework employed a 256-bit AES algorithm for ensuring the bitstream security and two asymmetric key-based security systems, namely the Public-Key Infrastructure (PKI) and elliptic-curve cryptography (ECC). The first one was used for sharing the authentication key between the FPGA and the user, generated in hard-wired logic using PKI and ECC. The AES key protection was guaranteed for each device by a physically unclonable function (PUF) -based scheme, where the private key, employed for the bitstream decryption, was derived and stored in the FPGA non-volatile memory, whereas the public shared key was generated using the private key. In Figure 1, the data and bitstream flow from a client to cloud service provider (CSP) was depicted; this last processed the data and executed the client's application, constituted by two sections, namely the house-keeping and kernel codes. The first one was a software application executed on the CSP, whereas the kernel code elaborated the user data employing the Xilinx CAD tool carried out in the FPGA on CSP. The FPGA locally decrypted the incoming encrypted (using the AES decryption module) data from the bitstream, executed the kernel code, and encrypted (using the AES encryption) again the outcoming data. A management tool, named FPGA-as-a-Service (FaaS), carried out the initial protection configuration (such as passing metadata) of the end-to-end communication between FPGA and CSP. The performances evaluation and the hardware utilization were estimated using three benchmark applications, namely word count (WC), sobel filter (SF), and logistic regression (LR). Based on the results provided by the last, with 64 GByte computational load, a performance improvement of up to 1.36x was obtained compared to the baseline; instead, for the SF application with 192 GB workload, just a 1.12x improvement in execution time was demonstrated.

Zynq Ultrascale+ MPSoC ZCU102 board: overview
The Xilinx Zynq Ultrascale+MPSoC ZCU102 board enables the quick prototyping of industrial, automotive, communications, and video applications ( Figure 2). It relies on the Zynq Ultrascale+ XCZU9EG-2FFVB1156E multiprocessor system-on-chip (MPSoC), which combines a powerful processing system (PS) and the efficient programmable logic (PL) section within a single package. The PS section includes three main processing units:   1297 memory for the PL-section ones. The ZCU102 board offers a PCIe (Peripheral Component Interconnect Express) slot, two mezzanine card interfaces for hardware expansion, universal serial bus (USB), high-definition multimedia interface (HDMI) interfaces, and RJ45 ports (Registered Jack type 45) for the ethernet connection.
The ZCU102 board provides several clock oscillators using the SI5341B clock generator (PS reference clock), the SI570 (manufactured by silicon labs, PL reference clock) I 2 C (Inter-Integrated Circuit) programmable oscillator (300 MHz default), and another SI570 programmable oscillator (156.2 MHz default). Furthermore, the board includes the SI5324 clock generator for the HDMI clock recovery and variable clock oscillators for the XCZU9EG MPSoC. The board has SMA connectors, named J79 (P-side) and J80 (N-side), to provide external clocks for the GTH transceivers. Also, the ZCU102 platform is equipped with an MSP430 microcontroller that communicates with the onboard programmable devices by the I 2 C interface. The user interface for system control, provided by Xilinx, enables us to check and manage the board's programmable features like the clocks, FMC (FPGA Mezzanine Card) functionalities power systems, and the PS-Side GTR (Gigabit Trans-Receiver) transceiver selection.
The embedded processor is a 64-bit quad-core ARM Cortex-A53, based on ARMv8-A architecture, operating with a 1.5 GHz clock frequency and 64-bit or 32-bit operative modalities. Each Cortex-A53 core is equipped with a separate 32 KB L1 instruction and data cache memories and a shared 1 MB L2 cache memory. The real-time processor is a 32-bit dual-core ARM Cortex-R5 (each equipped with the same previous chance memory and a 128 KB Tightly Coupled Memory-TCM), based on the ARMv7-R architecture, and reaching a 600 MHz maximum clock frequency. The graphics processor is an ARM Mali-400, supporting a single geometry processor and a two-pixel type one, supporting 667 MHz clock frequency; also, it has 64 KB L2 cache read-only memory and 4x/16x anti-aliasing support. The platform management unit (PMU), a dedicated user-programmable processor, monitors the board's power usage, error management, and system initialization before the booting stage; it employs a battery power mode to maintain security configuration and an RTC (real-time clock) also when the board is powered off. The PMU is equipped with a read-only memory containing a set of instructions such as the startup sequence, interrupts, and power-up/power-down requests; also, it stores the system-power state at all times PS-level and propagation logic errors.
There are two memory typologies available in Zynq Ultrascale+ systems, namely a 256 KB RAM on-chip memory (OCM) and off-chip DDR memories. The first stage boot loader (FSBL) is loaded by the OCM from the boot device; after loading the FSBL in the OCM, either APU or RPU processor executes it. The PS is internally divided into three power regions, isolated from each other by the PMU, allowing functional isolation between regions; each power region can be associated with a power mode, where only some of the components are active. In full-power mode, all sections are fully operating, thus the most energetically expensive mode; in low-power mode, the active components include the RPU, OCM, TCMs, and all peripherals except serial advanced technology attachment (SATA) and DisplayPort. The battery-power modality is featured by the lowest power consumption and includes a battery-backed RAM (BBRAM) to store the encryption key and an RTC supported by an external clock generator to keep time also whatever the system is turned off. The Zynq UltraScale+ MPSoC PS has four high-speed serials I/O (HSSIO) interfaces that support the protocols: PCI Express®; SATA 3; DisplayPort interface (with video resolution up to 4K x 2K-30 (30 Hz pixel rate); USB 3.0; Serial GMII (Gigabit Media Independent Interface).
The input-output processor (IOP) peripherals are interfaced with the external devices through a shared bank of up to 78 dedicated multiplexed I/O pins. Each peripheral can be mapped on multiple devices concurrently by using pre-set pin groups. The PL section can access most IOP interface signals; if the 78 pins are not enough, standard PL I/O pins can be used. Furthermore, extended multiplexed I/O (EMIO) allows unmapped PS peripherals to access the PL I/O pins to extend the interfacing capability of the MPSoC. The ZU9EG uses a 16nm FinFET technology. The PL section includes 208 HP and 120 HD I/O pins, 24 GTH 16.3 Gb/s high-speed transceivers, and a monitoring system for detecting the chip temperature as well as internal voltages and currents. It also contains new high-performance peripheral interfaces, such as 1G ethernet and four Gen2 PCIe interfaces.
The overall FPGA resources are organized, according to the clock management, in a column-and-grid layout; some of the PL resources are dedicated to the processing system for implementing transceivers, memory interface logic, clocking circuit, and I/O interfaces. Other blocks included inside the PS, such as the PCIe interface, configuration logic, and monitoring system, are integrated into the SoC. Generally, FPGAs have dedicated clock routes, known as clock regions, distributed in a chip region. For the UltraScale+ architecture, the clock regions are 60 CLBs height, corresponding to a bank of 52 I/O interfaces, 24 DSP slices, 12 block RAMs, or four transceiver channels [48]. The clock region width, in terms of CLBs number, affects timing repeatability inside them, regardless of the availability of device resources and their distribution. The clock region includes vertical and horizontal clock routing to distribute the clock signal within a region; these A multi-layered ARM advanced microprocessor bus architecture (AMBA), Advanced eXtensible Interface (AXI) bus connects the MPSoC blocks and the PL portion, allowing multiple simultaneous master-slave transactions. The AXI bus is designed with the shortest paths to connect the memory blocks and support high throughput connections to the slave blocks. The CPU, direct memory access (DMA) controller, and a combined entity representing the masters in the IOP generates the AMBA AXI bus data, supervised by the interconnection's quality of service (QoS) block.

Comparison of proposed AES-128 encryption/decryption algorithm with other works presented in the scientific literature
As above described, in [16], we proposed a high-speed and resource-efficient implementation of the well known AES-128 encryption/decryption algorithm, developed for a custom high-frequency (around 60 GHz), short-range (1-10 m) communication system, named "wireless connector". A Xilinx ZCU102 development board has been employed as the core section of the developed communication apparatus, implementing all the baseband tasks, such as modulation/demodulation, coding/decoding, and encryption/decryption to ensure communication security. Particularly, the proposed AES-128 encryption/decryption system employs a pipelined approach, enabling concurrent processing of multiple data packets each clock cycle within the 10-rounds elaboration, distinguishing the AES-128 cypher. Also, together with the development of the fast implementation of Sub Bytes operation through a 32-bit 16x16 Sbox matrix, the processing time of each AES-128 round is reduced to only one clock cycle. The behavioural and postimplementation simulations, along with onfield tests carried out after the board programming, demonstrated that a 220 MHz maximum clock frequency is sustained by both the cypher and decipherer; also, only ten clock periods are needed to provide encrypted and plaintext data packets, respectively, and loading new data packets every clock cycle, thus resulting in data throughput over 28 Gbit/s ( . . 128 × 220 = 28.16 ). Also, a rapid key expansion algorithm has been developed, combining, by combinatorial operators, the current sub-key with those at the previous step treated with the Sbox, thus obtaining the 44 sub-keys involved in the ten rounds of AES-128 algorithm in only 174.55 ns.
The proposed AES-128 cypher and combined encryption/decryption system are compared with different high-speed pipelined implementations presented in the scientific literature, supporting similar operating frequency and throughput (Table 2). Besides, the FPGA device employed by each considered work is indicated because it affects its performances. To compare the different solutions, the efficiency has been chosen as a merit figure since it considers the obtained throughput jointly and used hardware resources. From the results in Table 2, it is evident that the AES-128 implementation proposed in [16] supports a relatively high data throughput (up to 28.16 Gbit/s) but uses fewer hardware resources compared to other similar works reported in the literature; therefore, it obtains a higher value of efficiency. By comparing the proposed solution with those reported in [52], the first supports a higher data throughput (i.e. +13.0%) but uses fewer FPGA slices (i.e. -8.78%), thus obtaining a higher efficiency (i.e. +23.81%). Considering the combined encryption and decryption system and comparing it with the solution proposed in [53], the first supports a higher data throughput (i.e. +30.6%), but also requires a lower FPGA hardware resources utilization (i.e. -6.7%), thus resulting in higher efficiency (i.e. +40.5%).

HIGH-THROUGHPUT DATA PROCESSING APPLICATIONS USING XILINX ZYNQ ULTRASCALE+ MPSoC ZCU102
This section explores the applications of the Xilinx Zynq Ultrascale+ MPSoC ZCU102 FPGA platform for high-speed data processing in several fields, such as image and signal processing, visual recognition, and hardware resource management. In [55], the authors proposed a novel architecture for parallel multi-view high-efficiency video coding (HEVC) decoder, using the Xilinx Zynq UltraScale+ MPSoC ZCU102 board as a hardware accelerator for complex operations. The proposed method can optimally decompress in real-time 3 videos with 1920 x 1080 pixels resolution using the low power processor. To improve the compression efficiency, the motion prediction between two consecutive frames is needed, so increasing the computational load and waiting time of the inter-view parallel implementation. Therefore, the authors developed a similar method that relies on the decompression order variation, implementing an interframe parallel approach with no data dependency between the frames. The frame dependence was defined based on the waiting time and frame frequency, allowing simultaneous elaborations of independent frames. The authors demonstrated that the MV-HEVC algorithm provided a throughput eleven times higher than the 3D-HTM16 software and a real-time decompression of a 388p 3-views video.
Huang et al. introduced an expandable FPGA-based digital pre-distortion (DPD) system, guaranteeing linear processing in 5G mm-wave transceivers transmitting wideband modulated signals [56]. The DPD engine architecture, implemented into FPGA, allows processing multiple samples per each clock cycle, operating at a clock rate of 300 MHz, achieving a scalable linearisation bandwidth up to 2.4 GHz. An undersampling transmitter observation receiver (TOR) was developed to dynamically update the DPD coefficients and capture the power amplifier (PA) distortion. The TOR is equipped with a single analog to digital converter (ADC) and an apposite training algorithm to update the DPD coefficients to change PA non-linearity. The authors demonstrated that the DPD engine using envelope complexity-reduced volterra-series (ECRV) for linearising 5G mm-wave wideband OFDM signals, providing performances comparable to CRV methods but consuming less power and occupying fewer hardware resources.
A 3D convolutional neural networks (CNNs) accelerator, suitable for embedded systems, is described in [57]. The accelerator employs a pipelined architecture and implements parallel computations using several multiply-and-accumulate (MAC) units to accelerate the inference task in CNNs; each vector performs simultaneously three vector convolutions. Furthermore, it carried out 3D convolution of feature maps, up to 256x256 pixels, and 64 AXI channels, working with a kernel size of 3 and 8-bit or 16-bit fixed-point logic. The optimization was realized using pragma directives and SDSoC functions. The proposed implementation stood out for low power consumption and excellent performances, reaching 32.08 GOP/s and an efficiency of 3.58 GOPs/W. Véstias et al. proposed an optimized and scalable architecture that improves inference execution times of CNNs, by using static and dynamic zero-skipping and weight pruning and by applying an 8-bit fixed-point representation based on FPGAs [58]. The architecture consists of two separate sections; the first one dedicated to convolutional layers and another to fully connected layers, allowing them to apply different optimization techniques independently. In the convolutional layers, the complete feature map is stored; then, the data from several blocks are loaded, including their weight coefficients, which are multiplied with the weights of kernels, avoiding zero activations for calculation. Kernels and memory of activations are partitioned and stored in separate memories allowing parallel reading. The convolutional layer module reads eight activations per clock cycle, but only one non-zero activation per cycle is sent. Furthermore, in the convolutional layers, to reduce the dispersion of weights in the kernels and the overhead due to the index information of the sparse weights' vector, they adopt the block pruning method. The technique prunes weights block instead of single weights, reducing the overhead data and enabling parallel MACs efficiently. The fastest solution was the architecture with the implementation of zero-skipping, static and dynamic pruning, and dual-rate memories, which shows 464 GOP/s and 216 GOPs/W in the Zynq XC7Z020, 1344 GOP/s, and 145 GOps/W in the Zynq XC7Z045.
In [59], the authors proposed hardware implementation of the you only look once (YOLO) object detector based on a mixed-precision CNN. The authors used a half-precision (16-bit) CNN in parallel for both the classification and the localization, as well as the binary (1-bit) precision CNN for the feature extraction. A half-precision weight cache is used for the former convolutional circuit, whereas the binary one is used for the 2D convolutional binarised neural network. All weights are stored on the off-chip DDR memory. The authors trained the mixed-precision YOLOv2 with their designed training system by using chainer deep learning framework. The results showed that the proposed framework was featured by an 85.2% recognition accuracy, 35 with a dedicated background model and processes grayscale images. In contrast, the second gives each pixel a dedicated background model per clock cycle and processes red-green-blue (RGB) colour model images along with the grayscale ones. The background model is read via the AXI memory controller, and Gaussian distributions are sorted using a simple bubble sort algorithm in parallel to colour-space conversion. The authors employed a lossless compression algorithm for reducing the memory bandwidth (i.e., the RAM access time) for hardware implementation by reducing the background model's size. The whole system used 22 W and obtained 32.8 GOP/s for the first implementation and 20.7 GOP/s for the latter.
In [61], the authors proposed two optimization methods based on a virtual-channel (VC) router, one consisting of scoring crossbar arbitration and the other of arbitration interception. The first method processes the priority and round-robin arbitrations in parallel, assigning a score to the packets based on both their priority and current round-robin factor as weight. The packets with the highest score are transmitted to the next router ( Figure 3). To prevent lower-priority packets from waiting in the virtual channels, the authors proposed an arbitration interception to overcome transmission congestion and latency, improving overall performance. If the high-priority channel fails to pass the crossbar arbitration, an arbitration interception signal is sent to the VC allocator, disabling the request of the high-priority channel, guaranteeing to the lower priority one the crossbar arbitration. The proposed router used 2644 LUTs and 1189 FF and showed 0.645 W power consumption. Shen et al. presented an FPGA-based service gateway user plane (SGW-U) system in mobile edge computing (MEC) aimed to 5G scenario [62]. The system includes a generic OpenFlow switch, and, for offload computational tasks, a programmable GPRS tunneling protocol (GTP) processor for GTP packets encapsulation and decapsulation. This last processor is programmed by programming protocol-independent packet processors (P4) code, consisting of two sub-systems, called the GTP Encap and GTP Decap. The OpenFlow switch manages the ethernet packets by forwarding them between the Backhaul and the edge servers. When they are transmitted from the Backhaul to the edge servers, the packets are offloaded by the switch to the Encap sub-system, where they are arranged as GTP packets by adding the correspondent overhead. Similarly, the packets are offloaded to the Decap sub-system when transmitted from edge servers to the Backhaul, and the GTP headers are removed. The system achieves a 10 Gb/s throughput for each port, with a 5 μs processing latency and a total of 40 Gb/s speed considering only four ports.
A non-volatile memory express over remote processor messaging (NVMe-over-RPMsg) software solution is described in [63], for emulating a remote storage system without requiring VMs. The proposed solution represents a high scalable framework for virtualizing remote storage systems such as NVMe solid state drive (SSD), overcoming the limitation of the NVMe SSD emulated method by quick emulator (QEMU). A guest operating system (OS) runs local applications, as wells as a remote OS, the independent storage management software, replacing the PCIe with RPMsg protocol to deliver messages between the two OSs. The remote core manages the communication between the guest OS and remote OS and represents the RPMsg endpoint as well as the front-end of NVMe-over-RPMsg. The remote OS implements the back-end, by emulating an NVMe SSD controller and processing NVMe commands received from the guest OS. The obtained results indicated a performance boost than the QEMU-NVMe, reducing the read/write latency by 45.4% and scaling up to 1.74x the read/write throughput. In [64], the authors benchmarked computer algorithms on three graphical accelerators for image processing applications: the ARM57 CPU, ZCU102 FPGA, and Jetson TX2 GPU, equipped with proprietary libraries (OpenCV, xfOpenCV, and VisionWorks, respectively). The algorithms, according to their functionality, are classified into six categories. The input processing category includes the pre-processing methods, including arithmetic methods, to convert the incoming format or several channels into another format. The image arithmetic category includes standard arithmetical or logical image processing operations; the algorithms can be distributed in different processing units regardless of data dependencies. The image filterstype algorithms compute the correlation between an input image and a kernel; in the image analysis, analytic kernels extract the image's features, including the colour distribution, the maximum and minimum pixel value, etc. The matrix product is included in the geometric transformation category. Finally, the composite kernels category contains all the kernels included in the above-described categories. The tests demonstrated the GPUs' superiority for standard and easy-to-parallelize methods, achieving a reduction of energy/frame ratio of 1.1-3.2 times than CPU and FPGA implementations. On the other hand, the FPGA works better than the other hardware accelerators for complex kernels, reducing the energy/frame ratio of 1.2-2.3 times.

TELKOMNIKA Telecommun Comput El Control
Xinkai et al. proposed an FPGA-based architecture that combines owning processing unit, enabling parallel and pipelined processing with buffering capability. The novel method splits the filters using a preliminary feature map before performing the Winograd approach, called Wino-transCONV. This algorithm eliminates the multiplication with 0 value and, then applies the transposed convolution, through the classic Winograd-transformed processing, to the splitting and remapping stage's outputs [65].
Finally, they presented a parallel-aware memory partition technique to coordinate parallel operations and to achieve efficient data access. The dataflow of the developed Wino-transCONV algorithm is graphically summarized in the following Figure 4; as evident, the splitting (S1) and remapping (S5) are added to the fast Winograd algorithm (S2-S4). The S1 processing divides into four sub-filters windows the K×K complete filter window; afterwards, each of them is transformed into the related sub-inFM window. The S2 processing performs matrix transformations for the input feature map as well as the filter, whereas for the output feature map, the S4 step carries out matrix transformations. The S3 processing carried out the element-wise multiplications manipulation (EWMM), whereas the S5 processing rearranges the m×m sized, intermediate output patterns, computed by the S4, into a 2m×2m sized outFM matrix. The proposed implementation reaches 639.2 GOP/s on the Xilinx ZCU102 board and 162.5 GOP/s on the VC706 FPGA-based platform, obtaining a 2.2-factor improvement of the performance and up to 11.7-factor on processing throughput compared to other works present in the literature. Table 3 summarizes the scientific works discussed above, classifying them in terms of application and benefits offered by the FPGA-based compared to traditional implementations [66], [67].   [59] YOLOver2 object detector Low recognition time (28 ms) Higher accuracy (mAP-mean average accuracy of 85.2%) Janus et al. [60] Gaussian mixture algorithm for video streams Process 4K video in real-time High throughput (32.8 GOPS) High efficiency (6.98 GOPS/W) High accuracy (percentage of wrong classified pixels varies from 6% to 11%) Guo et al. [61] Virtual-Channel router Reduced hardware resource (-10% LUT compared to a traditional router) Low packet latency High throughput (+165% compared to a traditional router) Shen et al. [62] Processing engine for packing and unpacking of GTP packet High throughput (10 Gbps) Low latency (5 µs) Zhang et al. [63] NVMe for remote processing messaging (RPMsg) Reduced latency (-45.4% compared to QEMU-NVMe system) High throughput (1.74x compared to QEMU-NVMe system) Di et al. [65] Accelerated Architecture for Wino-transCONV of GANs High throughput (8.6x ÷11.7x compared to conventional Convbaseline) High performance (2.2x compared to conventional Convbaseline)

CONCLUSION
The FPGA-based platforms represent an ideal solution for high-speed processing due to their intrinsic capability to parallel elaboration as well as their flexibility both in the design phase and runtime step. Thanks to these advanced features, the FPGA devices are widely applied in 5G networks/systems for implementing critical tasks, such as accelerators for C-RAN controllers, network slicer employing NFV, and characterization of MIMO channels. In this review paper, we explored the applications of FPGA devices for high-speed processing; in particular, an overview of the main applications of the FPGA devices in the 5G networks/systems was presented, exploiting their flexibility and advanced performances. These features enable the network apparatuses to support different functionalities and operating modalities in runtime to dynamically respond to the communication requirements and improve resource and power efficiency. Thanks to hardware virtualization, the FPGA applications are opening new frontiers for developing the new generation of network device able to comply with the evolution of the communication architectures. Afterwards, an AES-128 cypher/decipher, implemented on the Xilinx Zynq Ultrascale+ MPSoC ZCU102 FPGA board, were presented, as well as, we introduced our efficient and performant AES-128 encryption/decryption system, suitably designed for a point-to-point, close-distance and high-throughput communication apparatus, named "wireless connector". The encryption/decryption system is based on the ZCU102 FPGA development board, implementing a pipelined strategy, enabling the parallel elaboration, in each clock cycle, of multiple rounds on distinct consecutive data packets, obtaining a higher data rate. Specifically, the developed AES-128 cypher/decypher reaches 220 MHz clock frequency, spending ten clock cycle s for both encryption and decryption; a data rate greater than 28 Gbit/s is also achieved thank to the employed pipelined solution and rapid implementation of the Substitute Bytes step. Besides, our AES-128 encryption system is featured by efficient hardware resource utilization, obtaining an efficiency higher (i.e., 8.63 Mbps/slice) than similar solutions reported in the scientific literature. Finally, further applications of the Xilinx Zynq Ultrascale+ MPSoC ZCU102 platform for high-speed processing were explored. Exploiting the wide range of peripherals and the advanced performances in terms of data throughput and storing capability, the ZCU102 board represents a powerful and versatile tool to implement custom solutions in various operative scenarios, such as image and signal processing, visual recognition, and hardware resource management. Huang et al. [36] MIMO channels emulation Xilinx Virtex-7 VH870 Efficient emulations of a large number of rays AR(Autoregressive)-fading channel generator reduces the required memory by 95% than traditional LUT-based approach Better scalability Danneberg et al. [37] 2x2