FPGA based low latency, low power stream processing AI
The timing and power of an embedded neural network application is usually dominated by the access time and the energy cost per memory access. From a technical point of view, the hundreds of thousands of look-up tables (LUT) of a field programmable gate array (FPGA) circuit are nothing more than small but fast and energy-efficiently accessible memory blocks. If the accesses to the block memory can be reduced or, as in our case, avoided altogether, the resulting neural network would compute much faster and with far lower energy costs.
We have therefore developed a design scheme that uses precomputed convolutions and stores them in the LUT memories. This allows small (mostly one-dimensional) convolutional neural networks (CNN) to be executed without block memory accesses: Activations are stored in the local per LUT registers and the weights and biases of all neurons are encoded in the lookup tables. Each neuron is assigned its exclusive share of logic circuits. This completely avoids the need for memory accesses to reconfigure a neuron with new weights and allows us to perform weight optimisations at design time. However, it limits the applicability of the overall method to comparatively small neural networks, since we need several LUTs per neuron and even the largest FPGAs only provide hundreds of thousands of LUTs.
To make this "in LUT processing" possible, we had to limit the set of available neural network functions. We have identified and implemented a set of functions that are sufficient to make the neural network work, but which can all be implemented efficiently in an FPGA without memory access. Our philosophy is that it is better to adapt the FPGA during training to make the best use of the limited resources available than to try to optimise the functions in hardware, resulting from a non-limited neural network.
To make this design scheme usable, we had to develop a set of design tools, helping the AI designer to convert a given reference AI in TensorFlow into an equivalent network of the available hardware functions and to finetune the AI to help to compensate the accuracy loss from changing the implementation. The two most powerful optimization techniques we applied are a variable bitwidht quantization and a depth-wise separation of the convolution.
In order to demonstrate the performance of this method, we implemented a CNN based ECG detection. Our implementation only used 40% of the available LUTs on the Spartan S15 chip and none of the blockram or DSP circuits. The system processed 500 pre-recorded ECGs of 5575 samples in 281ms, using only 73mJ in total, resulting in 10 million samples per second and an energy cost of 26.2nJ per sample.