Input-Aware Approximate Computing

In the last decade, Approximate Computing (AxC) has been extensively employed to improve the energy efficiency of computing systems, at different abstraction levels. The main AxC goal is reducing the energy budget used to execute error-tolerant applications, at the cost of a controlled and intrinsically-tolerable quality degradation. An important amount of work has been done in proposing approximate versions of basic operations, using fewer resources. From a hardware standpoint, several approximate arithmetic operations have been proposed. Although effective, such approximate hardware operators are not tailored to a specific final application. Thus, their effectiveness will depend on the actual application using them. Taking into account the target application and the related input data distribution, the final energy efficiency can be pushed further. In this paper we showcase the advantage of considering the data distribution by designing an input-aware approximate multiplier specifically intended for a high pass FIR filter, where the input distribution pattern for one operand is not uniform. Experimental results show that we can significantly reduce the power consumption while keeping an error rate lower than state of the art approximate multipliers.


I. INTRODUCTION
The well-known Moore's law drove the wide world explosion of the integrated circuits market and the related exponential growth of the Information and Communication Technologies (ICT). Computers are now everywhere and managing all aspects of our life: from health monitoring, work, entertainment, home management, etc. Unfortunately, behind all the advantages, the dark face of ICT performances stems from the high energy consumption of digital circuits. Indeed, ICT devices and services are responsible for a substantial percentage of the total energy consumed in the world [1]. Even worse, the amount of required energy is expected to grow to almost 21% by 2030 [1]. A direct consequence is the huge research field dedicated to the design, development and fabrication of energy efficient digital circuits. The quest for emerging technologies and computing paradigms in the light of energy efficiency graal provided meaningful solutions so far [2], [3].
Among them, Approximate Computing (AxC) proved to be a very promising one [4]. The idea behind AxC is that several applications do not really need to be executed on a "precise" and thus "energy-expensive" hardware. AxC aims at reducing the precision of the hardware in order to save energy consumption. Interestingly, the reduced precision leads to applications providing less accurate, but still good enough results while reducing by orders of magnitude the required energy [5], [6]. Such applications are characterized to be intrinsically resilient to noise and errors affecting the computation (i.e., because of the less precise hardware). Indeed, the inherent resiliency property tightly depends on the application domain.
Well-known examples are algorithms dealing with noisy real-world input data (e.g., image processing, sensor data processing, speech recognition, etc.), or with outputs that require human interpretation, such as digital signal processing of images or audio; also data analytic, web search and wireless communications exhibit an equivalent property [7]- [9]. Other examples are iterative applications that process large amounts of information, sample data, stop the convergence procedure early, or apply heuristics [10]. Most of the proposed techniques try to define new methods to generate alternative versions of specific component (either hardware or software) with fewer resources. For example, there are several proposals of approximate arithmetic operations [11]- [13]. Such variants differ from speculative implementations because they do not focus on generating alternatives, rather on restoring the possible introduced error [14]- [16]. Other techniques generate variants by considering a high-level description of the application or its implementation at low-level [7]. Moreover, existing approaches target only implementations at a specific level of the computing stack, i.e. either software or hardware.
Even if the above existing techniques proven to be effective, they have been developed without considering the final application. In other words, they are not customized w.r.t the application and its workload (i.e., input data). We thus believe that there is the room to introduce a novel and promising approach to power efficiency based on the knowledge of the input data distribution and the target application. Indeed, there are several applications where the inputs do not follow a uniform distribution pattern. For instance, an image or signal processing application that is always working in similar environments with a limited range of inputs.
Let us resort to an example to clarify this point. We used the LeNet-5 [17] Convolution Neural Network (CNN) that is composed of 3 convolutional layers (CONV) followed by 2 fully connected (FC) layers, with a total of 61,470 parameters.
To showcase the concept, we profiled the inference execution of LeNet-5 when the 10,000 MNIST test images are applied and we obtained the data distribution per bit and per layer shown in Figure 1. On the X-axis, we show the 32 bits from LSB (bit 0) up to the MSB (bit 31). On the Y-axis, we show the probability that the bit i is equal to logic '1'. As a first comment, it can be noted that the distribution is quite similar for the layers. This means that it is possible to design an approximate arithmetic circuit to be used in all layers (thus simplifying the overall design). A second comment is about the fact that from bit 0 to 6 the probability to have logic '1' is quite low (smaller the 20%). Interestingly, also the MSBs have low probability to be logic '1', especially for layer 0 and layer 4. These results confirm that it is possible to aggressively approximate the arithmetic circuits not only by working on the LSBs (as it is usually done) but also through the MSBs.
The main goal of this paper is to show that it is possible to take advantage of the data distribution to obtain a fine-tuned approximation and thus achieve better results (i.e., higher energy efficiency) with a lower impact on the application accuracy. In particular, by characterizing the input pattern distribution, we can design an approximate version of a given circuit exploiting the fact that some inputs with similar characteristics are elaborated more frequently than others. Therefore, in this work we present an input-aware approximate multiplier specifically designed for a case study design (high pass FIR filter), where the input distribution pattern for one operand is not uniform. The proposed design allows us to significantly reduce the power consumption while keeping an error rate lower than state of the art approximate multipliers.
The rest of the paper is organized as follows. Section II describes the state of the art of approximate multipliers. Section III details the application case study while Section IV presents the proposed Input-Aware Approximate multiplier tailored for the case study. Section V depicts the results and presents a comparison with state of the art multipliers. Finally, Section VI concludes the paper.

II. RELATED WORK
Generally, a conventional multiplication operates in three steps. In the first step, the partial products (PP) are generated by multiplying the multiplicand and the multiplier. In the second step, the PP tree is reduced by accumulation until only two rows remain. In the final step, the remaining two rows are summed by employing a carry propagation adder [18]. The approximation can be applied to each of these steps. Authors in [19] proposed an approximate 2x2 multiplier which is used to compose larger multipliers as shown in Figure 2. In this 2x2 multiplier accuracy reduction happens when both the inputs are "11" so instead of "1001" the output is "111". This way, the output is reduced to three bits and the circuit is simplified. Another approximation method is applied in the PP tree. For example [20] introduces an array multiplier where the least significant carry-save adders are removed from the circuit, both horizontally and vertically. Another simpler method has been applied in [21] where the LSBs from the inputs are truncated. In the PP perforation-based multiplier several consecutive rows of PPs are removed that are not necessarily from the LSBs. In [22] an approximate Wallace tree multiplier is presented which utilizes a carry-in prediction and a bit-width-aware approximate multiplication. In this design, the n-bit multiplier is implemented by four n/2 bit submultipliers, and the most significant submultiplier (A H B H ) is divided again into four n/4 bit ones. The n/4 bit multipliers can have different accuracies and the three remaining less significant multipliers A different approach relies on inaccurate counters and compressors in the PP reduction stage. An approximate 4x4 multiplier is presented in [23] that uses an inaccurate 4:2 counter for PP reduction as shown in Figure 3. This counter only results in a wrong answer when all the inputs are '1'. So instead of "100", the result is "10". If the input distribution is uniform the error rate for this 4x4 multiplier is 1/256. Larger multipliers can be built using this 4x4 approximate multiplier. In [24] a novel approximate adder is used to accumulate the PP tree by generating a sum and an error bit out of two adjacent inputs. Then an error recovery scheme is applied to accumulate the error bits in the final result either using only OR gates or approximate adders as well. Also, a truncated version of the same multiplier is presented in [25], [26] where the lower half of the PP is carved out of the circuit. In [27], the authors proposed a hybrid partial product-based approximate 4x4 multiplier that uses the methods of approximating in both the PP tree accumulation and removing insignificant PP bits.This 4x4 multiplier is used to build larger blocks of multipliers. They have altered the PP tree by turning the PP elements using propagate and generate function and removed the PPs with less probability of being logic '1'. They have also presented a novel half adder and full adder to accumulate the PP tree.
Next section describes the application used as case study and its characterization to obtain the input pattern distribution information.

III. CASE STUDY:FIR FILTER
The case study is an finite impulse response (FIR) filter described by Eq. 1. where: is the input signal: • y[n] is the output signal; • N is the filter order. In our case study N = 52; • b i is the value of i th coefficient of the filter.
The FIR used as case study came from an audio application, an in particular the coefficients have been designed to   Fig. 4 shows the architecture of the filter. On this architecture we intend to approximate the multipliers (i.e., b i · x[n − i]). First, we profiled the coefficients since they are the constant inputs for the multiplier and we reported the data distribution in Table I. The next section will detail the table and how that is used to design the approximate multiplier.

IV. INPUT AWARE APPROXIMATE MULTIPLIER
As already mentioned, the proposed multiplier intends to exploit the peculiar input distribution for the FIR high pass filter. In particular, for one of the multiplication operands, five bits are set to logic '1' 100% of the time (B 10 to B 14 ); moreover, the MSB has a 98% probability to be '1'. The complete input distribution pattern is reported in Table I. The inputs in this filter are fixed-point 16-bit binary numbers with one bit for the integer part and 15 bits for the fraction. The multiplication for the fixed point numbers is the same as the unsigned multiplier. However, since there are 5 bits always set to '1' (B 10 to B 14 ), there is no need to calculate the relative PPs. Instead, we can directly use the other operand's bit values. The resulting partial product tree for this multiplier is sketched in Figure 5.
If we consider input A as the one with a non-uniform   distribution pattern, the squares are representing the AND gates generating the PP a i b j . The circles represent the related input bits of b because inputs a[14 : 10] are always '1' in this application. In this way, the number of the AND gates needed to generate the PP tree is reduced by 80% without any accuracy loss.
To approximate the multiplier, we have adopted the 4x4 multiplier design proposed in [23], introduced in Section II and shown in Figure 3. We used them to build the 8x8 multiplier illustrated in Figure 6. In particular, the least significant submultiplier (A L B L ) is removed and an accurate 4x4 Wallace multiplier is used for the most significant part (A H B H ) to maintain acceptable accuracy.
The outputs of the 4x4 sub-multipliers are accumulated using accurate adders. Then, to implement the 16x16 multiplier, we used the approximate 8x8 multiplier for the two least significant parts of the multiplier (A L B L and A L B H ). However, for the two more significant parts (A H B H and A H B L ) we have adopted an input-aware accurate Wallace multiplier with the related AND gates removed. In this design, the related b input bits are directly used as the PP since  Figure 7a. The final accumulation step is done with exact adders as shown in Figure 7b.
Since the probability of being '1' for the MSB in the input for this high pass filter is 98%, the final column of the PP tree is also a good candidate for removing the related AND gates. Although it is very unlikely that the mentioned bit to be '0', when this happens the Error Distance (ED) can be very high for this approximate method. We have also included this design in our comparison and the results are presented in the next section.

V. EXPERIMENTAL RESULTS
For the circuit evaluation, we compared our 16-bit inputaware approximate multipliers with some of the multipliers from the EvoApproxLib [28] library. Besides the presented IAA multiplier, we have also simulated an accurate version for this multiplier where the PP tree (IAM 16) is missing the respective AND gates. Since the probability of being '1' for the MSB in the input a is 98%, another version of the multiplier is also implemented and the last column of the AND gates is removed as well (IAM 16 V2). Three Wallace multipliers are also implemented and included in the comparison: (i) the conventional Wallace tree, (ii) an input aware Wallace tree with 5 columns on AND gates removed, and (iii) the one with the MSB column removed as well. We have calculated the Power consumption, delay, and area of these 16-bit multipliers by synthesizing them using the Synopsys Design Compiler (DC) at 45nm. The designs were done in Verilog at the gate level. These results are presented in Table II.
In order to assess the quality of the approximation, the evaluation employed the following error metrics: • The error distance (ED): is the arithmetic difference between the accurate result and the result from the multiplier for a given input. • The mean error distance (MED): is the mean of all possible EDs. • The error rate (ER): is the probability of producing an incorrect result. • The normalized mean error distance (NMED): is the normalization of MED by the maximum output of the accurate design. • the mean relative error distance (MRED), which is the average value of all possible REDs. For the error analysis, we have simulated the multipliers using the dataset from the hamming high pass filter as one of the operands and uniform random input for the other. As mentioned before, the inputs for the multiplier of this application are fixed-point binary numbers where MSB is the only bit on the left side of the point and the remaining 15 bits are in the fractional part. The multiplication process for the fixed point binary numbers is the same as the unsigned multiplication. This means that we can ignore the binary point of a and b, perform the multiplication, and put the binary   [28] 100.00% 2.472E-08 2.901E-08 5.75708E-18 mul16u AQ1 [28] 0.00% 1.059E-09 1.489E-09 2.466E-19 mul16u BMC [28] 0.00% 0 0 0 mul16u CK3 [28] 100.00% 3.208E-05 3.314E-05 7.47E-15 mul16u DAE [28] 100.00% 1. Our Input Aware approximate multiplier (IAAxC) has very low power consumption compared to the other multipliers and showed better accuracy if compared to the ones with less power consumption. If we only compare the area for the different versions of the Wallace multiplier we can see it is reduced by 7.5% since we have removed the AND gates respective to the non-uniform input distribution pattern. Meanwhile, there is no accuracy loss in Input aware versions of the Wallace and segmented multiplier (IA Wallace and IAM). In the second version of these multipliers (IAM V2 and IA Wallace V2), where the final column of AND gates in the PP tree is removed as well, the probability of the MSB in a to be '0' is very low (1.98%), it results in a big error distance. However, in many applications, a low Error Rate is more important than low error distances.

VI. CONCLUSIONS
In this work, we have proposed the concept of Input Aware approximation which is a promising approach for having more efficient computers. Given the chance of knowing the data set the system is dealing with, we can design and approximate it knowing the fact that the input is following a specific behavior. The example we have adopted in this work is only one of the many applications with such a working environment where the input distribution is not uniform. For future works, we are planning to use the IAA approach for a larger number of applications with more possibility of approximating. There are some works in progress for automating the approximation where the computer will decide which part of a design and with which method should it be approximated. Input distribution patterns also can be included in this process of decision-making.
ACKNOWLEDGMENT This work has received funding from the APROPOS project in the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 956090.