A Novel Posit-based Fast Approximation of ELU Activation Function for Deep Neural Networks

Nowadays, real-time applications are exploiting DNNs more and more for computer vision and image recognition tasks. Such kind of applications are posing strict constraints in terms of both fast and efficient information representation and processing. New formats for representing real numbers have been proposed and among them the Posit format appears to be very promising, providing means to implement fast approximated version of widely used activation functions in DNNs. Moreover, information processing performance are continuously improved thanks to advanced vectorized SIMD (single-instruction multiple-data) processor architectures and instructions like ARM SVE (Scalable Vector Extension). This paper explores both approaches (Posit-based implementation of activation functions and vectorized SIMD processor architectures) to obtain faster DNNs. The two proposed techniques are able to speed up both DNN training and inference steps.


I. INTRODUCTION
Nowadays, Deep neural networks (DNNs) are being employed as a pervasive tool to process data and effort is being put by both industry and academia. The most active thread is bringing real-time DNNs performance at the lowest cost possible in terms of power and resource consumption.
An important emerging industry trend in this sense is the progressively reduction of DNNs complexity, reducing the information representation bits trying to avoid complex highprecision arithmetic (e.g. double precision, 64-bit arithmetic). Some formats have already been proposed like Google's BFLOAT16 (i.e. a revised Float16 representation embedded in Tensor Processing Units) and Intel's Flexpoint [1,2]. NVIDIA has put effort in the transprecision neural network training field in its latest GPU architectures, enabling the use of INT32 down to INT4 integral types alongside with Float32 single-precision type [3,4]. Furthermore, one of the most promising alternative representation for lowprecision real arithmetic is the Posit number system [5]- [8] (see details in Section II). When trying to address the bottlenecks in training and evaluation of DNNs we need to take into account two important components. One is the massive use of convolution, pooling and small matrix-vector product operations that can be more or less brought back to accelerating vector operations. The other less impactful, yet meaningful to address, is the enormous use of non-linear activation functions nearly after each layer of a DNN.
On the activation function side the use of non-linear operator is mandatory to offer enough complexity to let the neural network learn. The most commonly used nonlinear operators are Sigmoid, Tanh (Hyperbolic tangent), ReLU (Rectified Linear Unit) and ELU (Extended Linear Unit) [9,10]. When choosing activation functions, both data distribution and information representation must be taken into account. In particular, the possibility offered by Posits of efficient and hardware-friendly non-linear activation functions has to be investigated in order to provide fast approximation of commonly used non-linear operators.
On the acceleration of matrix and vector operation side a lot of work has been done using Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) ability to massively parallelize operations on vectors and matrices, but with a great power cost. On the general purpose CPU side exploiting vectorization levels offered by processors is critical. Both Intel [11] and ARM [12] provide specific compilers that offer a starting level of vectorization called auto-vectorization, i.e. automatic loop unrolling followed by the generation of SIMD (single instruction-multiple data) instructions at compilation time. Thanks to this approach, the SIMD instructions operate on multiple data elements at the same time, thus increasing the efficiency of the loop execution. Moreover, both Intel and ARM offer a set of high-level instruction (e.g. C/C++ directives) called intrinsics (i.e. AVX/2 or SSE for Intel [13] and SVE/2 or NEON for ARMv8 [14]). These instructions allow the developer to explicitly instrument low-level vectorization engines with a high-level interface.

II. POSIT FORMAT AND ARITHMETIC
The novel Posit format has been proposed by John L. Gustafson in [5]. The format is a fixed length encoding configurable in the total number of bits nbits and the number of exponent bits esbits. We identify a Posit with nbits and esbits as Posit(nbits,esbits) (e.g. Posit16,0). It is composed by four fields: • Sign field (1-bit, ) • Regime field (variable length, ) • Exponent field (maximum length of ebsits, ). This field can be shorter or even missing at all, for some representations, even when ebsits > 0 • Fraction field (variable length, ): can be missing too. Among the other fields, the regime one is particularly interesting. The bits composing that field are discovered at run-time as a bit-string composed only by 0s or 1s and terminated, respectively, by a single 1 or 0. Then the value of the regime field is dictated by the number of leading 0s or 1s. Given a Posits on nbits, esbits, represented by the integer l and let e and f be respectively the exponent and fraction values, the real number r represented by that encoding is: where useed = 2 2 esbits and k is the value dictated by the regime bits. Decoding a Posit presents interesting aspects: • When the sign bit is 1 then the remaining bits are complemented before decoding, removing the need for a redundant representation of a negative 0. • The value k identified by the regime bits acts as a superexponent that scales the value of useed. If the bit-string is composed by 0s the value of k will be negative. Note that when we are dealing with consecutive 1s, the value of k is one less than the number of equal 1s, in order to be able to represent the value 0.

A. Projective reals
Unlike real numbers, Posits map to a circle called Posit ring. If we split the ring into its four quadrants we can detect two important regions in Posit arithmetic: B. The cppPosit library cppPosit is a Posit arithmetic software library developed at University of Pisa. It exploits some of modern C++ features, like templatization, to provide a flexible and interoperable way to handle Posit numbers. It is designed in order to decouple the front-end Posit packed representation and the back-end approach to mathematical operations.
An important aspect of the cppPosit library is the presence of four different operational levels, from L1 to L4, that correspond to different efficiency of operations on Posits. Level 1 (L1) operations are simply bit manipulations of Posit representation that can be performed at the cost of an integer ALU operation. Managing to design L1 activation functions in DNNs is crucial to speedup activation layers when using Posits. Table I shows some important L1 operations implemented in cppPosit. Level 2 (L2) operations require the extraction of Posit fields without further computations. The cost is determined by the format encoding/decoding operations. Level 3 (L3) operations require the unpacked Posit version to be built thus including full computation of regime and exponent as added cost with respect to L2. Level 4 (L4) operations require conversion to Float format, exploiting either software or hardware back-ends.  [5] yes esbits=0 FastTanh(x) [6] yes esbits=0

III. THE EXTENDED LINEAR UNIT (ELU)
As reported in Table I, both Sigmoid and Tanh activation functions have an approximated fast version. The applicability of those activation functions in large DNNs may result in the well-known phenomenon of vanishing gradients. The ReLU activation function: has been introduced to cope with this kind of phenomenon, having an unbounded co-domain in the positive x-axis.
As pointed out in [15], if the ELU is scaled by a predetermined parameter, it applies a normalization across the layers of the networks, without the need of additional normalization layers (e.g. batch normalization layer).

A. Fast approximation: FastELU
In order to build a L1 approximation of the ELU function we must focus on the negative x-axis, since the first quadrant is simply an identity function. If we look at the Sigmoid function expression Sigmoid(x) = 1 e −x +1 we can manipulate it with some algebraic steps: If we substitute back the sigmoid expression in it we get the ELU expression for negative values of the argument : Referring to Table I we can prove that this expression is an L1 one.
Step (4) is L1 since the previous step produce a result in [0, 1].

IV. EXPERIMENTAL RESULTS
The benchmark used for experimental analysis is an image classification task on two different data sets: MNIST and GTRSB (German Traffic Road Sign Benchmark). The LeNet-5 deep neural network model [16] has been used during the experimental phase. For vectorization comparison we chose ARM SVE since it allows changing SIMD vector sizes at runtime without the need of compiling the code again (unlike Intel SIMD engine). The benchmark is compiled both with and without ARM SVE support to assess the performance increase, using the armclang++ 19.2 compiler. Table II shows performance comparison for the GTRSB dataset benchmark executed in the ARMv8 (AARCH64) instruction emulator with different levels of vectorization. As reported, when increasing the lenght of SVE registers the number of SVE instructions decreases, thus the processing time for the benchmark decreases as well, showing the effectiveness of the vectorization.
V. RESULTS ON FASTELU USING POSITS Table III and IV show performance comparison in the two datasets between both different activation functions and underlying information representation. As reported therein, Float32 accuracy are easily matched by Posits with 16 down to 10 bits, and, in particular, for GTRSB similar performance are obtained even with a Posit8,0. According to these results the adoption of Posit and ELU can lead to nearly the same processing accuracy of Float32 but with a remarkable reduction, up to a factor of 4, of the data storage.

VI. CONCLUSIONS
In this work we have introduced a fast way to approximate the well-known ELU activation function in DNNs, when using the novel Posit format for representing the reals, instead of classic IEEE-754 Floats. Then we have reported the preliminary results on an activity carried out within the show that vectorization has a positive effect on network processing time. Moreover, combining the use of Posits and of the ELU activation functions, DNN computation can achieve the same accuracy levels of IEEE-754 Floats but with a reduction up to a factor of 4 of data storage and transfer complexity. Future work will include the development of hardware accelerators for Posit operations.