Hardware Acceleration of Biomedical Microwave Techniques using High Level Synthesis

Microwave radiations have been proven to be effective in biomedical applications, including brain stroke monitoring. However, microwave algorithms in these applications are computationally expensive with compute-intensive parts that are termed “kernels”. To speed up the medical diagnosis, it is crucial to adopt new methodologies to accelerate the execution of these kernels by using specific hardware solutions. A recent trend in designing hardware accelerators is High Level Synthesis (HLS), which creates an implementation starting from a high-level description (C or C++) of an algorithm. In this paper, we first categorize the recurrent medical microwave techniques and kernels. Then, we propose efficient hardware accelerators for these kernels in programmable devices by using HLS. Several hardware optimization strategies are introduced and their impact on the overall performance is explored. We believe that the analysis of these kernels and their corresponding hardware acceleration techniques can be greatly beneficial to the biomedical microwave research community.


I. INTRODUCTION
Microwave Imaging (MI) has been used progressively in recent years in various medical applications [1], [2]. Although other medical imaging modalities such as CT-scan or MRI are widely used in the same area, the non-ionizing, low cost, and non-invasive characteristics of MI systems make them a suitable complement to the traditional imaging systems. To obtain the required information about the internal structure of the body tissue, several antennas are used to emit and capture microwave radiations to and from the tissue. When the emitted radiations pass through the body, "scattering" phenomena occur due to the difference between dielectric properties of the body tissues. The final microwave measurements are thus the scattered radiations captured by antennas, represented usually by a scattering matrix [1].
Retrieving original information about the properties of the object under investigation (body tissues) from the microwave measurements requires processing the scattering matrix. There are several algorithms that can be used for this purpose that will be thoroughly explored in the next section. The computational complexity of these algorithms is one of the challenges of MI systems. To speed up their execution, these MI algorithms must be analyzed to determine the computeintensive parts that are termed kernels. Hardware acceleration of MI kernels can boost their performance, making them suitable for low-cost, real-time, embedded systems. In this paper, we discuss the computational complexity of relevant biomedical microwave techniques and introduce specialized hardware acceleration methodologies for their kernels. By leveraging the advantages of High level Synthesis (HLS), we apply various optimizations to design accelerators in specialized hardware like Field-Programmable Gate Arrays (FPGAs).
The rest of the paper is as follows. In Sec. II biomedical microwave techniques are introduced. Sec. III describes hardware acceleration methodologies, and Secs. IV-V explain the results and conclusions, respectively.

II. BIOMEDICAL MICROWAVE TECHNIQUES
The general diagram of an MI device for brain stroke imaging is shown in Fig. 1. A similar diagram holds for MI devices used in other applications, like breast cancer imaging. The system consists of several antennas that transmit and receive microwave radiations and are connected to a Vector Network Analyzer (VNA) through a switching matrix. The (VNA) measures these radiations in the form of a scattering matrix. MI can be defined as an "inverse scattering" problem in which the input is the scattering matrix and the output is the required information about the internal structure of the body tissues (normal brain tissues and stroke).
MI algorithms used to solve the inverse scattering problem are usually divided into two categories: Qualitative and Quantitative imaging [3]. Machine Learning (ML) as a new category is recently introduced, which can be also used to learn the previous inverse scattering solutions, other than for specific tasks like classification. In the following, each of these three categories and some of the compute-intensive parts of the corresponding algorithms are explained.

A. Qualitative medical diagnosis
Inverse scattering is inherently a non-linear problem. However, in qualitative imaging, it is approximated as a linear problem. The simplest approach for this linearization is Born approximation. In this category, the presence and shape of the anomaly (i.e. brain stroke) is detected.
Linear Sampling method (LSM) [4]- [5], Factorization method [6], Truncated Singular Value Decomposition (TSVD) [7], Time Reversal (TR) techniques such as multiple signal classification (MUSIC) algorithm [8] and Eigenvalue Decomposition (EVD) of the TR operation [9], Beamforming approaches [10] and several Radar-based methods [1] are among these linearized qualitative algorithms. Matrix multiplication is one of the critical parts of these algorithms due to the large data dimensions. Although qualitative methods are not highly accurate, they are well suited for weak scattering objects and low-contrast scenarios. Furthermore, in some algorithms it is possible to obtain quantitative results when such approximation holds.

B. Quantitative image reconstruction
In quantitative imaging, the exact values of image pixels are reconstructed by solving the non-linear inverse scattering problem. These methods are more accurate because they consider the non-linearities of the problem. However, they are computationally more intensive.
Some of the non-linear quantitative algorithms are Contrast Source Inversion [11], inexact Newton methods [12], and DBIM-TwIST [13] which belong to the Microwave Tomography deterministic approaches. In addition, stochastic techniques including Simulated annealing and Genetic Algorithm are among the other quantitative MI methodologies [3].
The iterative nature of these algorithms leads to their high complexity. For example, as shown in Fig. 2, in each iteration of the image reconstruction, a forward solver such as Finite Difference Time Domain (FDTD) is used to update the solution, which is one of the most compute-intensive MI kernels.

C. Machine Learning
In recent years, ML has attracted attention in the field of medical Microwave Imaging. Different kinds of Deep Neural Networks (DNNs) are used to replace the computation of the forward solver or the inverse scattering solution [14]- [18]. Although these solutions can produce quantitative results, they are mainly evaluated on synthetic data because the extraction of training data for each application is not a trivial task. Nevertheless, ML algorithms can also be used for qualitative imaging. In this scenario, two steps are usually required: feature extraction and classification.

III. HARDWARE ACCELERATION METHODOLOGIES
In this section we describe the methodologies to design hardware accelerators for the compute-intensive MI kernels. Despite being less complex, qualitative algorithms need hardware acceleration when used in low-cost embedded systems. Some of these qualitative kernels have been already considered for hardware acceleration, such as the FPGA accelerator in [8] for MUSIC algorithm. In the following, other MI kernels and their hardware acceleration in FPGA are explored.

A. Matrix multiplication
Matrix multiplication is one of the building blocks for many MI kernels. Specifically, in this work, we focused on the It is used in the computation of the covariance between two matrices and is also useful in the PCA algorithm.
When the matrix dimensions are large, it is not possible to store all the elements in local memory. On the other hand, data transfer between external and local memories is timeconsuming. Therefore, a trade-off must be found between local storage and data transfer. For this trade-off, we adopted a block-streaming methodology as shown in Figs. 3 and 4. The main idea is to partition the input data into several blocks and compute the diagonal and off-diagonal elements of the multiplication separately for each block.

B. EVD and SVD
In several MI algorithms described in the previous section, eigenvalues or singular-values of a large matrix must be computed (MUSIC, TSVD, LSM). Therefore, accelerating EVD and SVD kernels is highly beneficial to the overall speed of these MI algorithms. For the acceleration of SVD, we used a built-in function in Vivado HLS 1 and optimized its performance by applying individual HLS directives such as pipelining and unrolling, which are two of the most important HLS techniques exploiting FPGA concurrency and parallelism. For EVD, we adopted a flexible design that can be used for both floating-point and fixed-point data precision [25].

C. 3D FDTD
Propagation of electromagnetic fields can be modeled by solving Maxwell equations based on FDTD algorithm. As explained in section II-B, 3D FDTD is the most critical part of the DBIM-TwIST algorithm as a non-linear quantitative MI reconstruction method. We adopted our previous work for FDTD acceleration in FPGA [26] shown in Fig. 5. To reduce the data transfer time, we used a spatial blocking methodology that is able to read a new 2D plane from the 3D simulation space while it is processing the update equations for the previous plane. These update equations are executed in each iteration for electric and magnetic fields and have the general form of for updating the magnetic field in x direction, where E c and E +x are the electric field of the central cell being calculated and of the next cell, respectively (similar equations for updating the magnetic field in other directions and for updating the electric field). We modeled the boundary regions with Convolutional Perfectly Matched Layers (CPML) by variables Ψ Hxy , Ψ Hxz that are used in the boundaries. In addition, we considered the impact of dispersive materials and polarization currents on the electric fields as opposed to conventional FDTD accelerators.

D. PCA
In Machine Learning algorithms, PCA is used for data dimensionality reduction and feature extraction. It can be used in MI for extracting features from the input scattering matrix obtained by doing n s measurements (or samples) at F different 1 Vivado HLS is the HLS synthesizer for Xilinx FPGAs. frequencies. Due to its symmetry, the number of independent elements of the scattering matrix at a specific frequency equals to N d = N × (N + 1)/2 for a system with N antennas. Therefore, the size of input matrix for PCA algorithm will be n s × N d × F .
PCA consists of several Compute Units (CU) that are required to be accelerated in hardware. These CUs are Mean, Covariance, EVD (or SVD), and Projection computations.
All the CUs are implemented in HLS and are processed in parallel by using the Dataflow optimization technique. Specific HLS-based optimizations, such as array partitioning, function inlining, loop unrolling and pipelining are used in the hardware design to increase the efficiency [25].

E. MLP
Detection and classification of anomalies from the scattering matrix can be done by using Neural Networks (NN). Multi-Layer Perceptron (MLP) is a fully connected network that can be used in MI diagnosis. In [27], an MLP is used for breast cancer detection with microwave sensing. When designing a hardware accelerator for the MLP, there are several parameters that must be tuned to maximize the performance. we used a statistical approach based on Bayesian Optimization (BO) for the automatic selection of hardware configurations and training hyper-parameters that can improve the development time of the hardware design.

F. Exploring other kernels
In addition to the traditional ML classifier such as SVM, Decision Tree (DT), RF, and KNN, recently there has been a growing interest in using DNNs for the inverse scattering problem in MI. Due to the high non-linearity of the problem, a proper training step is required for the selected model, calling for a sufficiently large dataset.
Creating the training MI dataset for the inverse scattering problem requires collecting the scattering data from various dielectric profiles and can be done synthetically or by real measurements, each of which is time-consuming. Generation of additional training data from a few MI measurements (real or synthetic) can boost the performance of an ML model. Generative Adversarial Networks (GANs) are a great candidate for the generation of training dataset. They consist of two competing networks that learn to produce realistic training data samples. The application of Machine Learning in Microwave Imaging and inverse scattering has been introduced in recent years and there are several challenges in the design of these ML models which are still under the research. In the future, we plan to further explore these ML models, their computational complexity, and their hardware acceleration methodologies.

IV. RESULTS
In this section, we demonstrate the advantage of using hardware acceleration for MI kernels by evaluating some of the biomedical microwave techniques in terms of hardware performance. For hardware implementation in FPGA, we used Vivado 2018.2 tool.

A. 3D FDTD accelerator
For the evaluation of FDTD algorithm, we used a synthetic MI dataset [26] that is a 3D model of the glycerol-water mixture that represents the dielectric characteristics of the human brain. We compared the performance of our FPGA accelerator for FDTD kernel with CPU and GPU designs. Compared to an Intel CPU (Xeon gold 5120) and a commercial GPU implementation (Acceleware), we could achieve 7.3× and 1.67× improvement in the processing time per antenna, respectively, in FPGA. In addition, we extracted systemlevel performance results for the non-linear MI reconstruction algorithm shown in Fig. 2. Table I shows a summary of the results for our single-FPGA design in addition to our multi-FPGA accelerator with 8 FPGAs for 24 antennas, and a total dimension of 70 × 70 × 70.

B. PCA accelerator
Evaluation of the FPGA accelerator for PCA is useful not only for ML-based algorithms in MI, but also for other linear or qualitative MI kernels such as TSVD because it consists of several widely-used compute units including EVD and SVD. We compared the processing time and resource usage for each compute unit of PCA on a large Virtex FPGA (VC709U evaluation board) that is shown in Fig. 6. We found that the main dimension of PCA input (the number of independent elements of scattering matrix) can have a maximum value of 300 (N d = 300) in our target FPGA, that corresponds to an MI system with 24 antennas.  By using the block-streaming strategy described in section III-A, the Covariance computation obtains a low latency, and the most critical part of the design is SVD as can be seen from Fig. 6, for which the optimized HLS built-in function is used.

C. MLP accelerator
To design and evaluate the MLP accelerator, we used hls4ml [28] that is a tool to convert recurrent ML models from Python to a synthesizable code that can be used in Vivado HLS. We used a dataset of microwave measurements that contain 4500 samples of scattering matrix with 462 elements. It consists of 9 classes representing the presence, type, and location of the brain stroke. For feature extraction, PCA is used that results in the reduction of features to 110. We selected an MLP with 3 hidden layers. The numbers of neurons per layer are 220, 64, 64, respectively. The target device is a Zynq SoC (ZedBoard), and we used fixed-point precision for the hardware implementation by leveraging hls4ml. The accuracy before and after hardware implementation, resource usage, and processing time are depicted in Table II and Figs. 7 and 8. Note the negligible accuracy loss in the hardware accelerator due to the reduced precision.

V. CONCLUSIONS
In this paper, we presented a brief overview of hardwarebased methodologies to improve the performance of recurrent biomedical microwave techniques. First, we categorized different MI algorithms and introduced their compute-intensive parts, termed kernels. Then, we presented specific hardware accelerators in FPGAs for each kernel to obtain the desired performance. Evaluation of the results for three different kernels (FDTD, PCA, MLP) shows the advantage of using hardware accelerators in MI. Specifically, we could achieve a maximum of 13.4× improvement for 3D-FDTD in the processing time per antenna compared to the commercial GPU design (Acceleware) by using FPGA accelerator. For PCA, we could optimize the performance of the compute units including SVD and Covariance computation, and for MLP, we could use a low-cost Zynq SoC for the detection and diagnosis of brain stroke, that is useful in real-time embedded systems.
In the future, more kernels will be added to the library of microwave algorithms and efficient hardware accelerators will be designed for them. Specifically, SVM classifier and Deep neural networks including GANs are considered as the future research.
ACKNOWLEDGMENT This work was supported by the EMERALD project funded by the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 764479.