Energy Efficient In-memory Hyperdimensional Encoding for Spatio-temporal Signal Processing

The emerging brain-inspired computing paradigm known as hyperdimensional computing (HDC) has been proven to provide a lightweight learning framework for various cognitive tasks compared to the widely used deep learning-based approaches. Spatio-temporal (ST) signal processing, which encompasses biosignals such as electromyography (EMG) and electroencephalography (EEG), is one family of applications that could benefit from an HDC-based learning framework. At the core of HDC lie manipulations and comparisons of large bit patterns, which are inherently ill-suited to conventional computing platforms based on the von-Neumann architecture. In this work, we propose an architecture for ST signal processing within the HDC framework using predominantly in-memory compute arrays. In particular, we introduce a methodology for the in-memory hyperdimensional encoding of ST data to be used together with an in-memory associative search module. We show that the in-memory HDC encoder for ST signals offers at least 1.80x energy efficiency gains, 3.36x area gains, as well as 9.74x throughput gains compared with a dedicated digital hardware implementation. At the same time it achieves a peak classification accuracy within 0.04% of that of the baseline HDC framework.


I. INTRODUCTION
A LMOST all breakthroughs in artificial intelligence in the last decade are characterized by an underlying machine learning model that entails a higher complexity in terms of the number of operations and parameters compared to contemporary models. Such increased model complexity demands more energy to perform training and inference tasks. Nevertheless, the human brain, with its far-reaching cognitive capabilities, consumes several orders of magnitude less power. This disparity has paved the way to exploring brain-inspired alternatives. Hyperdimensional computing (HDC) [1] is one promising brain-inspired computing approach that relies on representing entities using high-dimensional (up to 10,000 dimensions) vectors called hypervectors. Similarly to the brain, where representations are spread across thousands of randomly originated neurons, a set of (pseudo)random orthogonal hypervectors forms the basis in the HDC framework. These hypervectors are then combined and compared using a welldefined set of algebraic operations to derive representations for composite entities and to find similarities, respectively.
A promising application domain for HDC is the spatiotemporal (ST) processing of signals, acquired by EMG sensors for example. The corresponding HDC algorithm was presented in [13], and later scaled up for high-density flexible EMG sensors [14], [18]. The same HDC algorithm has been used in a variety of applications such as EEG [16], iEEG [17], ExG [19] in general, as well as speech recognition [20], delivering higher classification accuracy than the established approaches. ST signal processing differs from other classes of applications because numerical data sequences received from multiple channels within a certain time window are considered as the input. Due to often being deployed at the edge of the Internet of Things and the confidential nature of the input data, the ST applications identified above could benefit immensely from energy-efficient hardware platforms.
Earlier works address ST HDC signal processing in lowpower hardware platforms, such as PULP [15] and ARM Cortex-A53 [21]. For example, PULP-HD [15] describes the implementation of the ST HD encoder on multiple cores in a PULP cluster for achieving a target of 10 ms detection latency in real-time. Nevertheless, the energy efficiency could be improved further by using in-memory computing approaches [10], [22], [23]. In-memory computing is an emerging paradigm where the physical attributes of memory devices are exploited to compute in place [24]. Operations that require manipulation and comparison of large strings of bit patterns, which are at the core of the HDC framework, are particularly well suited for in-memory computing [10].
In this work, we propose an in-memory computing-based system for ST signal processing with HDC. An illustration of the system is given in Fig. 1, taking the EMG-based hand gesture recognition as a use case. Compared to prior work on in-memory HDC encoding [10], we present a novel inmemory computing HDC encoding architecture tailored for ST inputs. When coupled with an in-memory associative memory search module, such as the one presented in [10], we get a complete in-memory HDC processor for ST signal processing. We derive classification accuracy results from simulations using a statistical model of phase-change memory (PCM) ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists,  Fig. 1. Concept of in-memory hyperdimensional encoding of spatio-temporal signals within the application of hand gesture recognition. First, the EMG signals are acquired from the electrodes connected to different parts of the subject's arm. After a pre-processing step, the data from different channels are embedded into one hypervector using spatial encoding and temporal encoding in a first memristive crossbar array, which is the focus of this work. The resulting query vector is passed to a second crossbar array to perform the associative memory search. The final result collected from the peripheral of the second crossbar predicts the class of the hand gesture.
crossbar arrays. Furthermore, we estimate the throughput, area, and energy efficiency of the in-memory ST HDC encoder and compare it with a dedicated digital encoder as well as a ST HDC encoder running on a low-power general purpose compute platform, PULP-HD [15].

II. ALGORITHM A. Conventional ST HDC encoding algorithm
In the conventional ST HDC encoding algorithm [13], the input acquired from each channel goes through several preprocessing steps [15] and is converted to a stream of discrete time samples. An HDC encoder maintains a record of N consecutive time samples from M channels to generate an encoded N-gram hypervector embedding defined as: where n ∈ {1, 2, ..., N} is the relative time index, m ∈ {1, 2, ..., M} denotes the source channel index and {l 1 , l 2 , ..., l L } are the L discrete quantization levels given in ascending order l 1 < l 2 < ... < l L . The output of the embedding function G is a D-dimensional binary hypervector.
First, the encoder projects the sample values s n,m to a high dimensional space using a so-called continuous item memory (CiM ) [13]. As opposed to assigning quasi-orthogonal hypervectors to every unique discrete sample value, the CiM sets the Hamming distance (HamD) between two vectors to be proportional to the absolute difference between the corresponding sample values. This is achieved by choosing quasiorthogonal hypervectors for the minimum and maximum levels l 1 and l L , and hypervectors corresponding to the intermediate levels that satisfy: The CiM -projected vectors are then bound to the relevant channel ID hypervector E m . The channel ID hypervectors are stored in an item memory (IM ) and are quasiorthogonal. The channel-bound hypervectors are given by I m,n = CiM (s n,m ) * E m , where * denotes the element-wise XOR binding operation. The channel-bound hypervectors are then bundled together to produce spatial hypervectors S n = M ajority(I 1,n , I 2,n , ..., I M,n ), where, for each dimension, the majority function outputs 1(0) if the majority of channels have 1(0). The spatial hypervectors S 1 , S 2 , ..., S N then enter a temporal encoding stage, which outputs the final encoded N-gram G according to the following equation: where t is the time step at which the N th sample of the current data record enters the system, and ρ is the vector permutation operator, which is implemented as a circular right shift. During the training phase, N-grams collected from the same class are further bundled to produce class prototype hypervectors P c , where c ∈ {1, 2, ..., C}, and C is the number of classes. During the inference phase, the N-gram produced from the same encoder is called a query hypervector Q and used to measure binary dot product similarity against each of the prototype hypervectors P c . The class with the highest similarity is selected as the predicted class.

B. Adaptations to suit in-memory computing
We propose several adaptations to the conventional algorithm to suit in-memory computing. First, all channel-bound hypervectors are pre-computed and unrolled using: instead of waiting for the spatial bundling given in (3). Finally, the hypervectors T m are bundled to produce the adapted Ngram hypervector G , given by: In summary, compared to the ST encoder in [13], the proposed in-memory ST encoding algorithm pre-computes channelbound hypervectors and pushes the bundling portion of the spatial encoding step downstream of the temporal encoding step. This offers further flexibility to set individual quantization levels and N-gram sizes per channel, which is not possible with the conventional encoder. This is a useful feature to exploit channel specific spatial and temporal dynamics.

III. ARCHITECTURE
The architecture of the in-memory ST HDC encoder is shown in Fig. 2. It consists of a memristive crossbar array and a few peripheral circuits, namely a circular buffer, a binder, and a bundler. The circular buffer is maintains the last N samples and reading them sequentially. It has M write pointers (wp 1 , ..., wp M ), synchronized to each one of the low-frequency external input channels that write data in parallel to the next allotted locations in the buffer. There is a single read pointer that is synchronized to the internal clock frequency which is set at least N M × faster than the external frequency to avoid any data loss. The read pointer (rp) traverses the whole input data record sequentially: it samples in chronological order within each channel and repeats over all channels.
As shown in Fig. 2, the pre-computed channel-bound hypervectors I l m given in (5) are stored along the rows of the M · L × D crossbar array. This allows us to save D · N · M XOR operations per input data record. Performing temporal encoding on each channel separately allows us to reduce the number of intermediate buffers in the digital domain that must possess read/write capability at the expense of additional readonly storage in the PCM crossbar array. This is an acceptable trade-off because the PCM device consumes approximately 23 fJ of energy per read operation [25]. This is just a fraction of the energy incurred by digital read/write buffers. Furthermore, thanks to their non-volatile nature, PCM devices do not consume energy when retaining their content in idle mode. The output of the crossbar array is connected to an array of sense amplifiers. Sequential processing of the data record allows time sharing the sense amplifier array and the downstream binder module, allowing us to save a significant amount of energy and area in the peripherals. The binder module consists of an array of D XOR gates daisy-chained with D registers, which collectively implement (6).
The bundler module in the architecture implements (7). An optional scan chain, which propagates a random hypervector generated bit-by-bit from a linear feedback shift register, is activated at the start of the encoding cycle when the number of input channels is even, with ties that are broken randomly. The majority function is implemented as an array of log 2 Mbit accumulators, followed by an array of comparators whose reference is set at ceil((M + 1)/2) − 0.5.
The controller module receives the encoding parameters and coordinates the flow across the rest of the modules. For example, it communicates the start/end addresses for each of the write pointers, the offset value added to circular buffer read data to derive the row address in the crossbar array, when to update the 1-bit register array in the binder module, when to update accumulators in the bundler, etc.

A. Experimental setup
The proposed in-memory ST HDC architecture is benchmarked on the EMG hand gesture recognition dataset [13]. It includes data acquired from five subjects who perform five classes of gestures. The data is sampled at a 500 Hz frequency via four EMG electrodes attached to each subject's forearm. The class label and channel readings are provided at each time frame. We use 25% of the 175× down-sampled data to train a 10,000-D HDC model for each subject and test on 800 queries on average per subject. The result is averaged across the subjects to obtain the classification accuracy. For PCM simulations, the statistical model described in [10] is used, which captures non-ideal effects such as spatial and temporal variations in the PCM crossbar array. Fig. 3 shows the classification accuracy obtained from an inmemory ST encoder running in software, as well as the same encoder simulated using the statistical model, and comparing these results with the baseline ST encoder [13]. In all three models of encoders, the in-memory associative memory search module is also simulated with the statistical model. The proposed in-memory ST encoder achieves a peak accuracy of 98.9% (see Fig. 3(a)) when N=9, L=15. This is only 0.04% lower than the peak accuracy in the baseline ST encoder. It is also a 1.1% improvement over the peak accuracy of 97.8% reported in the reference encoder [13], which uses the binding result of two channels to break ties while performing the associative memory search using Hamming distance.

B. Classification accuracy results
As the N-gram size decreases, the spatial encoding plays a more prominent role than the temporal encoding. Thus, the higher accuracy delivered by the in-memory ST encoder compared to the conventional ST encoder (see Fig. 3(a)) for smaller N-gram sizes can be explained by the spatial bundling operation being relocated downstream in the in-memory ST encoder. This allows retaining more useful spatial information in the encoded N-gram hypervector. Fig. 3(b) shows that the in-memory encoder simulated with the statistical model exhibits an increasing accuracy drop, compared to the same encoder running in software, as the quantization levels increase. This is because as the quantization levels are increased, the PCM crossbar array size increases linearly, thereby amplifying the negative effect of spatial PCM variations on the classification accuracy. However given that the in-memory ST encoding operations in Equations (5) to (7) involve element-wise operations, or operations that involve neighboring elements, the crossbar array can be easily split into several realistic size [26] subarrays with simple single wire connectivity between subarray peripherals. This facilitates the silicon realization with negligible additional cost in terms of energy and area, as well as the mitigation of the effect of PCM spatial variations by compensating for subarray level conductance variation.

C. Energy efficiency study and benchmark
We performed an energy efficiency study of the in-memory ST encoder. The binary PCM device specifications given in  [10] are used as the reference for obtaining power and timing numbers of the crossbars. The power and timing numbers for the digital peripherals are obtained from component-wise simulations of a post-synthesis netlist generated with 65nm CMOS technology. For comparison, we considered an equivalent digital ST encoder operating entirely in CMOS, whose power and timing numbers are obtained from a componentwise simulation of the post-synthesis netlist generated at the same technology node. Both digital peripherals and the equivalent CMOS encoder operate at 440 MHz and 1.2 V supply voltage. We observe that the in-memory ST encoder is able to produce 31.5M, 18.9M, and 10.5M N-grams/s for N-gram sizes of 3, 5, and 9, respectively, which is a throughput improvement of 9.74× over the digital counterpart and a 0.28M× improvement over the 10 ms fixed latency PULP-HD [15]. The total area of the in-memory ST encoder varies from 0.37 to 0.44 and 0.51 mm 2 when quantization levels are set to 3, 12 and 21, respectively. This is a 3.36×, 5.46×, and 6.97× area reduction, respectively, compared to the digital CMOS ST encoder. When breaking down the area numbers further we find that, irrespective of the number of quantization levels, a fixed area of 0.32 mm 2 is occupied by the digital peripheral logic including the circular buffer, the binder and the bundler; another area of 0.02 mm 2 is occupied by the row decoders and the sense amplifiers; while the rest of the area is taken by the PCM device array itself.
We estimated the energy efficiency of the in-memory ST encoder and compared it with the digital ST encoder as shown in Fig. 4. The in-memory ST encoder achieves a peak energy efficiency of 75.1M N-grams/s/W with N=3 and L=3. The total energy required for encoding an N-gram using this configuration is 13.3 nJ, 91.4% of which are spent on digital peripheral circuits, 8.4% on sense amplifiers/row decoders, and a mere 0.12% on PCM devices. This results in a 1.80× energy efficiency gain compared to a similarly configured digital ST encoder. The gain improves to a maximum of 8.83× as the N-gram sizes and quantization levels are increased (see Fig. 4). Table I presents physical and performance characteristics of 1-core and 4-core PULP-HD encoders compared with a dedicated digital CMOS encoder and the proposed PCM-based in-memory ST encoder with the same parameter configuration (N = 3, L = 21). The energy required for N-gram encoding is reduced from the µJ range for the PULP-HD implementations to the nJ range for the dedicated CMOS and PCM-based inmemory encoders. In summary, when compared with 1-core and 4-core PULP-HD encoders, the in-memory ST encoder achieves 1320× and 284× higher energy efficiency, respectively.

V. CONCLUSION
In this paper, we have demonstrated HDC encoding on spatio-temporal signals using in-memory computing techniques on memristive crossbar arrays. This approach allows selecting separate parameter combinations for each channel, further enhancing the flexibility of the encoding process. By simulating our architecture with a phase-change memory statistical model, we obtain a peak classification accuracy of 98.9% (within 0.04% of the baseline), while achieving 1.80×-8.83× higher energy efficiency over a dedicated digital CMOS encoder and a 284× gain energy efficiency over an encoder running on a low-power general purpose computing platform.