Comparing the performance of Hebbian against backpropagation learning using convolutional neural networks

In this paper, we investigate Hebbian learning strategies applied to Convolutional Neural Network (CNN) training. We consider two unsupervised learning approaches, Hebbian Winner-Takes-All (HWTA), and Hebbian Principal Component Analysis (HPCA). The Hebbian learning rules are used to train the layers of a CNN in order to extract features that are then used for classification, without requiring backpropagation (backprop). Experimental comparisons are made with state-of-the-art unsupervised (but backprop-based) Variational Auto-Encoder (VAE) training. For completeness,we consider two supervised Hebbian learning variants (Supervised Hebbian Classifiers—SHC, and Contrastive Hebbian Learning—CHL), for training the final classification layer, which are compared to Stochastic Gradient Descent training. We also investigate hybrid learning methodologies, where some network layers are trained following the Hebbian approach, and others are trained by backprop. We tested our approaches on MNIST, CIFAR10, and CIFAR100 datasets. Our results suggest that Hebbian learning is generally suitable for training early feature extraction layers, or to retrain higher network layers in fewer training epochs than backprop. Moreover, our experiments show that Hebbian learning outperforms VAE training, with HPCA performing generally better than HWTA.


Introduction
The error backpropagation algorithm (backprop) has been used with great success for training neural networks (e.g., [9,35]) on a variety of learning tasks. However, neuroscientists doubt that it is biologically plausible and that it models the real learning processes of the brain [27].
A possible biologically plausible learning mechanism could be based on the so-called Hebbian principle: ''Neurons that fire together wire together.'' Starting from this simple principle, it is possible to formulate different variants of the Hebbian learning rule which are interesting also from the computer science point of view. For example, Hebbian learning with Winner-Takes-All (HWTA) competition [7] allows a group of neurons to learn to perform clustering on a set of data. Another interesting variant is Sanger's rule [33], which allows to perform Principal Component Analysis (PCA) on the data in an online fashion. In essence, Hebbian algorithms can be employed to extract features of interest from data and provide a biologically plausible, efficient, and online solution for unsupervised learning tasks.
In the context of Convolutional Neural Networks (CNNs), the various network layers act as feature extractors, with lower layers extracting low-level features and next layers extracting progressively higher-level features. Therefore, Hebbian learning algorithms could represent a promising option for training such networks.
Previous works [2,36,37] already showed that Hebbian learning variants are suitable for training relatively shallow networks (with two or three layers), which are appealing for applications on constrained devices. For instance, in [1], preliminary results showed that HWTA competition was effective to re-train higher layers of a pre-trained network, achieving results comparable with backprop, but requiring fewer training epochs, thus suggesting potential applications in the context of transfer learning.
In this work, we take a step further and apply Hebbian learning on deeper network architectures. We perform a more detailed investigation of the HWTA learning rule, and we analyze the Hebbian Principal Component Analysis (HPCA) learning rule [13,33] to train deep CNNs.
We compared Hebbian algorithms, which are unsupervised, with another popular unsupervised (but backpropbased) approach, namely the Variational Auto-Encoder (VAE) [14]. We also deemed interesting to report the results obtained with supervised backprop training on an equivalent network, in order to give a more complete picture of the impact of different learning methodologies on the training process.
Specifically, a six layer try-out network was considered. The network was trained using the various learning approaches on the MNIST [20], CIFAR10, and CIFAR100 [17] datasets. We evaluated the quality of the features extracted from each layer by feeding these features to linear classifiers and evaluating the resulting accuracy. We decided to adopt a simplified network model because the focus of this work is not to evaluate the performance of a new complex network model, but rather to compare different learning approaches on an appropriate architecture. The six layer try-out network allows us to perform extensive experimentation, and to get insights on the effect of different learning paradigms on each network layer, evaluating the quality of the resulting feature extractors on a layer by layer basis.
Furthermore, in order to assess the impact of switching from backprop to Hebbian training layer by layer, we also considered hybrid models in which some network layers are trained with backprop and others with Hebbian learning. Such hybrid models were also studied in [1], but only preliminary results where presented involving just the HWTA learning rule and just one dataset. In this work, we provide a more comprehensive evaluation of the HWTA rule, as well as the HPCA rule, using more datasets in our experiments.
Although Hebbian learning is an unsupervised approach, supervised variants were also proposed in literature. Some of these [19,30,34] are based on the concept of a teacher neuron coupled with a purely Hebbian learning rule. In the following, we will refer to classifiers trained with such approach as Supervised Hebbian Classifiers (SHCs). Other approaches [22,25] are based on the alternation between Hebbian and anti-Hebbian update phases, while also using a supervision signal. This kind of alternating strategy is called Contrastive Hebbian Learning (CHL). Another contribution of this paper is to provide an experimental evaluation of classifiers based on SHC and CHL on the various datasets.
Results in this paper confirm that Hebbian learning can be integrated with backprop, providing comparable accuracy when used to train lower or higher network layers, while requiring fewer training epochs. Moreover, they show that features learned by Hebbian training outperform VAE features in the classification task, with the HPCA variant performing generally better than HWTA.
The main contributions of this paper can be summarized as follows: • Hebbian Winner-Takes-All (HWTA) and nonlinear Hebbian Principal Component Analysis (HPCA) learning rule variants, properly integrated with convolutional layers (Convolutional HWTA/HPCA), are applied to learn feature extractors in CNNs; • The results on various datasets are compared with those obtained by unsupervised VAE, and the potentials and limitations of the methods are highlighted; we also deemed interesting to report the results of supervised backprop training in our discussion; • We also provide an experimental evaluation of hybrid neural network training (i.e., a scenario in which some network layers are trained with backprop and others with Hebbian approach) and supervised Hebbian learning variants on various datasets.
The remainder of this paper is structured as follows: Sect. 2 provides a background on the related literature; Sect. 3 describes our scenario of investigation, including how Hebbian learning is integrated with convolutional layers, hybrid network models, SHC, and CHL classifiers; Sect. 4 delves into the details of our experimental setup; In Sect. 5, the results of our simulations are illustrated; Finally, Sect. 6 presents our conclusions and outlines possible future developments.

Background and related work
Consider a single neuron with weight vector w and input x. Call y ¼ w T x the neuron output. The Hebbian learning rule, in its most basic form, can be expressed mathematically as [8]: where w new is the updated weight vector, w old is the old weight vector, and Dw is the weight update. The latter term is computed, according to Hebbian learning, as follows: where g is the learning rate. Basically, this rule states that the weight on a given synapse is reinforced when the input on that synapse and the output of the neuron are simultaneously high. Therefore, connections between neurons whose activations are correlated are reinforced.

Hebbian WTA
To prevent weights from growing unbounded, a weight decay term is generally added. In the context of competitive learning [7,15,32], this is obtained as follows: This rule has an intuitive interpretation: when an input vector is presented to the neuron, its vector of weights is updated in order to move it closer to the input, so that the neuron will respond more strongly when a similar input is presented. When several similar inputs are presented to the neuron, the weight vector converges to the center of the cluster formed by these inputs (Fig. 1). When multiple neurons are involved in a complex network, the Winner-Takes-All (WTA) [7,32] strategy can be adopted to force different neurons to learn different patterns, corresponding to different clusters of inputs. When an input is presented to a WTA layer, the neuron whose weight vector is closest to the current input is elected as winner. Only the winner is allowed to perform a weight update, thus moving its weight vector closer to the current input (Fig. 2). If a similar input will be presented again in future, the same neuron will be more likely to win again. This strategy allows a group of neurons to perform clustering on a set of data points (Fig. 2).
In recent works [36,37], WTA and the variant k-WTA (in which the k neurons with highest activations are elected as winners) were applied in the context of computer vision to train a three layer CNN to extract features from images, in order to perform classification. Similar paradigms were also studied in the context of Spiking Neural Networks (SNNs) [4,5]. These works showed that the approach is suitable to train relatively shallow networks (e.g., with two or three layers), achieving accuracy around 65-70% on CIFAR-10 and from 95% up to 98-99% on MNIST, which is comparable to backpropagation-based approaches on networks of the same depth.
In [1,19], the authors provided preliminary experiments on a single dataset (CIFAR10), by applying Hebbian-WTA learning to CNNs with up to six layers, comparing the results with those obtained by training the same network with backprop. The WTA approach, as it is, is unsupervised, but a supervised Hebbian learning variant was also proposed in order to train the final classification layer. The results confirmed that the approach was effective for training shallow networks. It was also found that the approach was effective for re-training the higher layers (including the final classifier) of a pre-trained network. In addition, the algorithm required much fewer epochs than backprop to converge.
The novel contributions of this work with respect to the previous one are that more extensive experimentation is performed using multiple datasets (MNIST, CIFAR10, CIFAR100), and a novel learning rule is also explored, in addition to Hebbian WTA. This is the Hebbian PCA learning rule, which is explained in the next sub-section. Moreover, we added experiments with VAE, for comparison with state-of-the-art backprop-based unsupervised learning. Finally, we performed experiments involving the supervised CHL and SHC methods, making comparisons between the two approaches and SGD training.

Hebbian PCA
According to the definition given above, WTA enforces a kind of quantized information encoding in layers of neural network. Only one neuron activates to encode the presence of a given pattern in the input. On the other hand, neural networks trained with backpropagation exhibit a distributed representation, where multiple neurons activate combinatorially to encode different properties of the input, resulting in an improved coding power. The importance of distributed representations was also highlighted in [6,24]. A more distributed coding scheme could be obtained by having neurons extract principal components from data, which can be achieved with Hebbian-type learning rules [3,33]. In order to perform Hebbian PCA, a set of weight vectors has to be determined, for the various neurons, that minimize the representation error, defined as: where the subscript i refers to the i th neuron in a given layer and E½Á is the mean value operator. It can be pointed out that, in the case of linear neurons and zero centered data, this reduces to the classical PCA objective of maximizing the output variance, with the weight vectors subject to orthonormality constraints [3,13,33]. From now on, we assume that the input data are centered around zero. If this is not true, we just need to subtract the average E[x] from the inputs beforehand. It can be shown that the following learning rule minimizes the objective in Eq. 4 [33]: In case of nonlinear neurons, a solution to the problem can still be found [13]. Calling f() the neuron activation function, the representation error can be minimized with the following nonlinear version of the Hebbian PCA rule: Several variants of the Hebbian PCA approach were explored in literature for the linear case [3,28,29,33], and applied in the context of computer vision [2], but only for relatively shallow networks. In our experiments, we applied the nonlinear version of the Hebbian PCA rule also on deeper networks, as explained in the following sections.

Supervised Hebbian learning
While the Hebbian approaches discussed so far are unsupervised, Hebbian learning can also be adapted to the supervised setting. We consider two approaches for doing so, the Supervised Hebbian Classifier (SHC) [19] and the Contrastive Hebbian Learning (CHL) [22] classifier. The idea behind the SHC approache is based on the concept of a teacher neuron [30,34,37], which ideally provides the target signal to a trainable neuron. The teacher's signal replaces the actual output of the neuron so that, when the Hebbian principle is applied, it reinforces the correlation between the input and the teacher-provided output. In this way, when a similar input is presented again, the neuron tends to produce a similar response. The SHC is realized by applying this principle in combination with the learning rule in Eq. 3. More specifically, calling t the teacher signal, the learning rule becomes: The teacher signal t should be 1 if the input's class correspond to that associated with the neuron, and 0 otherwise. The effect of this rule is that the neuron's weight vector will converge towards the centroid of the cluster formed by only those inputs associated with the target class that the neuron is supposed to detect. In CHL, the network alternates between two processing stages, a free phase and a clamped phase. During the free phase, ordinary processing occurs. Let us call denote the input and output of a neuron after the free phase as x À and y À , respectively. An anti-Hebbian update is computed after the free phase, according to the formula: During the clamped phase, the neuron outputs are clamped to a desired value. Call x þ and y þ the input and output of a neuron after the clamped phase. At this point, a regular Hebbian update is performed: This approach was shown to be able to approximate backprop training under mild conditions [38], but in a biologically plausible and Hebbian fashion. CHL can be applied for training a linear classification layer by replacing the classifier's output y þ with the teacher signal t during the clamped phase (while the inputs x þ ¼ x À ¼ x are the same for the two phases), thus leading to the total update: Note that this update is equivalent to a gradient descent update of a linear classifier on a Mean Squared Error (MSE) loss [8,22,25].

Hebbian learning on deep CNNs
In the following, we describe our approach to use Hebbian learning with deep CNNs. We introduce the strategy used for integrating Hebbian learning methods with convolutional layers, and the technique used extend the Hebbian learning approach to a supervised setting. In addition we introduce the try-out neural network architecture used to evaluate our approach, and the hybrid (Hebbian-backprop) learning modality.

Convolutional HWTA/HPCA
In order to be able to use the Hebbian rules with CNNs, we had to define a proper way to integrate these rules with convolutional layers. In particular, neurons at different horizontal and vertical offset of the convolutional layer are constrained to have shared weights. Previous works [2,36] handled convolutions with Hebbian learning by extracting random patches from the images, or by processing patches sequentially, one at a time, and feeding each patch to a single column of convolutional filters. This approach is poorly parallelizable, and does not exploit all the information contained in the image.
In order to meet the convolutional constraints, we considered a different approach, in which the learning rule was adapted as follows: each set of neurons looking at the same portion of the image computed their updates by applying the desired rule, the input x being the patch extracted from the image at the specific horizontal and vertical position. We then averaged the updates over the horizontal and vertical dimensions (Fig. 3). The resulting update was applied to the kernel shared by all the neurons at different horizontal and vertical locations. When mini-batches of inputs were used during training, the update averaging was performed also over the mini-batch dimension.

SHC and CHL classifiers
In order to evaluate Hebbian learning also in teh supervised setting, we implemented SHC and CHL classifiers. These classifiers are trained on top of the features extracted from pre-trained networks, freezing the already trained network layers.
SHCs are trained using the learning rule in Eq. 8. The teacher signal was set to the target output that the neuron was required to produce for a given input. Similarly, CHL classifiers are trained according to Eq. 11, where the free phase output is the ordinary output provided by the classifier, and the clamped phase output was set to the target value.

Network architecture and evaluation
The focus of this work is not to evaluate the performance of complex network architecture. Rather we aim at evaluating and comparing the effects of Hebbian learning approaches, supervised backprop, and VAE under various settings. Accordingly, we defined a try-out model, where it is possible to perform a large number of experiments and get insights about the effect of the learning approach on various network layers, by evaluating the quality of the features extracted from the network on a layer by layer basis. This architecture makes also the experiments more practical to be reproduced by other researchers. The following subsections illustrate the try-out network architecture and the evaluation procedure.

Try-out neural network architecture
The deep neural network used in this work consists of six layers: five layers plus a final linear classifier. The various layers are interleaved with other processing stages (such as ReLU nonlinearities, max pooling, etc.), as shown in Fig. 4. The architecture is inspired to the AlexNet [18], where one of the fully connected layers was removed and, in general, the number of neurons was slightly modified, to allow a finer grained analysis of the various learning approaches. In our experiments we compared both HWTA and HPCA learning approaches, with supervised backprop and VAE. Below, we also discuss more details of the VAE and supervised backprop training.

Variational auto-encoder for unsupervised learning
We compared the unsupervised Hebbian approaches with another popular unsupervised method, namely the Variational Auto-Encoder (VAE) [14]. We considered the VAE architecture shown in Fig. 5: the try-out network model in Fig. 4, up to layer 5, acted as encoder, with a fully connected layer mapping the output feature map to a 256 gaussian latent variable representation, while a specular network branch acted as decoder.

Backprop training for supervised learning
The first part of our experiments is mainly focused on comparing unsupervised learning approaches, i.e., Hebbian learning and VAE. Nonetheless, we also deemed interesting to include the results provided by supervised backprop learning in our discussion. For this purpose, we also report the results obtained by training a network with the same architecture as the try-out model shown in Fig. 4, by using supervised end-to-end Stochastic Gradient Descent (SGD) training on a cross-entropy loss metric.

Evaluating internal network layers
As we will also discuss in Sect. 5, we aim at evaluating how the Hebbian approach affects the capability of learning feature extractors in the various layers of the try-out neural network, on a layer by layer basis. In order to evaluate the quality of the features extracted from the various layers of the trained models, we cut the try-out network, in correspondence of the various layers, and we placed a linear classifier on top of each already trained layer (for example, Fig. 6 shows a classifier on top of the first network layer). Then, we evaluated the accuracy achieved by classifying the corresponding features. This Fig. 4 The try-out neural network used for the experiments (image from [1]) was done for the Hebbian-trained networks and for the VAE network, in order to compare the results, and also for the supervised backprop-trained network, as we also deemed interesting to include these results in our discussion.

Hybrid network models
We also implemented hybrid network learning, i.e., scenarios in which some network layers were trained with backprop and others were trained with Hebbian approach (Fig. 7), in order to asses the impact on accuracy when replacing backprop layers with Hebbian equivalent. The models were constructed by replacing the upper layers of a pre-trained network with new ones, and training from scratch using different learning algorithms. Meanwhile, the lower layers remained frozen, in order to avoid adaptation to the new upper layers. Various configurations of layers were considered.

Details of training
We implemented our experiments using PyTorch. 1 All the hyperparameters discussed below, resulted from a parameter search, based on Coordinate Descent (CD) [16], to maximize the validation accuracy in the respective scenarios. CD works as follows: starting from an initially selected point in hyperparameter space, one coordinate (i.e., hyperparameter) at a time is perturbed, and the resulting hyperparameter configuration is evaluated. Hyperparameters are updated in the direction of the perturbation that leads to an improvement in the result. The steps are the following: 1) get hyperparameter set according to CD based on previous validation results; 2) train the model with the given hyperparameters and record the resulting validation accuracy; 3) repeat from point 1 until no further improvement is obtained. Concerning the datasets that we used, the MNIST dataset contains 60,000 training samples and 10,000 test samples, divided in 10 classes representing handwritten digits from 0 to 9. In our experiments, we further divided the training samples into 50,000 samples that were actually used for training, and 10,000 for validation. The CIFAR10 and CIFAR100 datasets contain 50,000 training samples and 10,000 test samples, divided in 10 and 100 classes,  respectively, representing natural images. In our experiments, we further divided the training samples into 40,000 samples that were actually used for training, and 10,000 for validation. In order to obtain the best possible generalization, early stopping was used in each training session, i.e., we chose as final trained model the state of the network at the epoch when the highest validation accuracy was recorded.

Training the try-out network
We used the try-out network architecture shown in Fig. 4. The model was fed with RGB images of size 32x32 pixels as inputs. The network was trained using Stochastic Gradient Descent (SGD) with error backpropagation and crossentropy loss, with the HPCA rule in Eq. 7 (in which the nonlinearity was set to the ReLU function), and with the HWTA rule. During Hebbian training, the final classifier was trained using the SHC approach, according to Eq. 8.
Training was performed in 20 epochs (although, for the Hebbian approach, convergence was typically achieved in much fewer epochs) using mini-batches of size 64.
For SGD training, the initial learning rate was set to 10 À3 and kept constant for the first ten epochs, while it was halved every two epochs for the remaining ten epochs. We also used momentum coefficient 0.9, and Nesterov correction [10].
Contrarily to standard momentum (which first corrects the accumulated momentum with the current gradient estimate and then updates the weight in the resulting direction), Nesterov method first updates the weights in the momentum direction, and then applies a correction to the accumulated momentum given by the gradient estimate at the new location. This look-ahead strategy helps correcting optimization trajectories and improves convergence.
Dropout rate was set to 0.5. L2 penalty was also used to improve regularization. We recall that this is a regularization term in the form k jwj 2 that is added to the loss function, in order to penalize large weights. Here, k is the weight decay coefficient, which was set to 5 Á 10 À2 for MNIST and CIFAR10, and to 10 À2 for CIFAR100.
In the HPCA and HWTA training, the learning rate was set to 10 À3 . No L2 regularization or dropout was used in this case, since the learning method did not present overfitting issues. In case of HWTA training, images were preprocessed by a whitening transformation as described in [17], although this step did not have any significant effect for the other training methods.

VAE training
VAE training of the network in Fig. 5 was performed in the same fashion as for the try-out network training but, obviously, in an unsupervised image encoding-decoding task. Specifically, the model was trained using the b-VAE [11] Variational Lower-Bound unsupervised criterion, with coefficient b ¼ 0:5. No L2 penalty nor dropout was used in this case. Note that the decoder part was removed at test time and the features extracted from encoder layers were used for classification.

Training of classifiers on top of internal layers
The SGD linear classifiers placed on top of the various network layers, as shown in Fig. 6, were trained with supervision, in the same way as we described above for training the whole try-out network. Learning rate was set to 10 À3 and the L2 penalty term was reduced to 5 Á 10 À4 . CHL classifiers were also trained as above, using the desired target as teacher signal, with learning rate set to 10 À3 and L2 penalty 5 Á 10 À4 .
The SHC linear classifiers placed on top of the various network layers were trained with learning rate set to 10 À3 , but no learning rate scheduling nor L2 regularization was needed in this case.

Hybrid network training
Hybrid network models were trained using various combinations of Hebbian and backprop layers, as in Fig. 7. Training was performed in a bottom-up approach, i.e., we first started by training the base try-out network with backprop, then we split the network at a desired point, removing all the layers on top, and replacing them with new Hebbian layers. The new Hebbian layers were trained using HWTA or HPCA, as described above, while the bottom layers remained frozen. This process produces a network whose bottom layers are trained with backprop, and top layers are trained with Hebbian. Again, a new splitting point can be chosen among the Hebbian layers, in order to remove all the Hebbian layers on top of the desired point, replacing them with backprop layers. Retraining the new layers with SGD, while the bottom layers are kept frozen, produces a network alternating backprop-Hebbianbackprop layers, as in Fig. 7. SGD training for the first or the last part of the hybrid networks (i.e., bottom layers or top layers) was performed as described above, but using L2 penalty 5 Á 10 À4 for the top layers, when the last splitting point was right before the ultimate or penultimate layer (hence, for retraining the last or the last two layers), and 5 Á 10 À 2 in all the other cases.

Results
In the following subsections, we present the experimental results on MNIST, CIFAR10, and CIFAR100 datasets. For each of these datasets, we present Tables 1, 3, 5, showing the accuracy obtained by a linear classifier trained on top of the features extracted from each network layer, in order to asses the quality of the respective features in the classification task. We compare the results of unsupervised HPCA, HWTA, and VAE training. Even though we mainly focus on comparing unsupervised methods, we also deemed interesting to report the results of supervised backprop (BP) training in our discussion. We also report, in Tables 2, 4, 6, the results obtained when retraining higher layers of a network pre-trained with backprop, together with the required number of epochs to convergence, in order to assess the potential of Hebbian approaches to tasks that involve retraining of higher network layers. In these cases, the final classification layer was trained by SHC, because, as we observed from other experiments (see Appendix 1), this method performed better than CHL on higher network layers, in terms of trade-off between accuracy and training epochs.
Supplementary results, included in Appendix 1, show the results of hybrid training, and the comparison between SHC, CHL, and SGD classifiers.
We performed five independent iterations of each experiment, using different seeds, averaging the results and computing 95% confidence intervals.

MNIST
In this sub-section we analyze the behavior of Hebbian learning approaches in a simple scenario of digit recognition on the MNIST dataset.

Classifiers on top of internal layers
In Table 1, we report the MNIST test accuracy obtained by classifiers placed on top of the various layers of the try-out network. We report the results obtained on the network trained with, respectively, supervised backprop (BP), VAE, HPCA, and HWTA.
Unsupervised approaches typically suffer from a decrease in performance when going deeper with the number of layers. The reason is that they are not able to exploit a supervision signal that enables the formation of task-specific features that are essential to boost the performance on higher layers. This can be observed both for HWTA and VAE training. With the HPCA approach, the problem seems to alleviate, and the accuracy remains pretty much constant when we move to deeper layers. In particular, the HPCA approach exhibits an increase of almost 2% points with respect to HWTA on the features extracted from the fourth convolutional layer. The Hebbian features appear to behave comparably or better than VAE features, especially on higher layers, with an improvement up to 8% points on the fifth layer. Moreover, we can observe that both Hebbian approaches reach higher performance with respect to backprop for the features extracted from the first two layers, suggesting possible applications of Hebbian learning for training relatively shallow networks.

CIFAR10
In the previous sub-section, we considered a relatively simple image recognition task involving digits. In this section, we aim at analysing Hebbian learning approaches in a slightly more complex task involving natural image recognition on the CIFAR10 dataset.

Classifiers on top of internal layers
In Table 3, we report the CIFAR10 test accuracy obtained by classifiers placed on top of the various layers of the network. We report the results obtained on the try-out network trained with, respectively, supervised backprop (BP), VAE, HPCA, and HWTA. Also in this case, the HWTA and VAE approaches suffer from a decrease in performance when going deeper with the number of layers. With the HPCA approach, this problem seems to alleviate, and the accuracy remains pretty much constant when we move to deeper layers. In particular, the HPCA approach exhibits an increase in almost 5% points with respect to HWTA on the features extracted from the fifth layer. Still, further research is needed in order to close the gap with backprop also when more layers are added, as it would be desirable to make the Hebbian approach suitable as a biologically plausible alternative to backprop for training deep networks. The Hebbian features appear to behave better than VAE features, especially on higher layers, with an improvement up to 24% points on the fifth layer. Moreover, we can observe that both Hebbian approaches reach higher or comparable performance with respect to backprop for the features extracted from the first two layers, suggesting possible applications of Hebbian learning for training relatively shallow networks. Table 4 aims to show that it is possible to replace the last two network layers (including the final classifier) with new ones, and re-train them with Hebbian approach (in this case, the supervised Hebbian algorithm is used to train the final classifier), achieving accuracy comparable to backprop (with a peak performance drop of just 2-3% points when the last two layers are replaced), but requiring fewer training epochs (1 vs 12, respectively). This suggests potential applications in the context of transfer learning [39].  Underline represents best overall result. Bold represents best result among unsupervised methods. The Hebbian approaches appear to perform better than VAE, especially when higher layer features are considered. Moreover HPCA improves over HWTA on higher layer features. It is also possible to observe that Hebbian training achieves comparable results with backprop when lower layer features are concerned

CIFAR100
In this sub-section, we want to further analyse the scalability of Hebbian learning to a more complex task of natural image recognition involving more classes, namely CIFAR100. In this case, we evaluated the top-5 accuracy, given that CIFAR100 contains a much larger number of classes than the previous datasets.

Classifiers on top of internal layers
In Table 5, we report the CIFAR100 top-5 test accuracy obtained by classifiers placed on top of the various layers of the try-out network. We report the results obtained on the network trained with, respectively, supervised backprop (BP), VAE, HPCA, and HWTA. Again, VAE and HWTA approaches suffer from a decrease in performance when going deeper with the number of layers. With the HPCA approach, this problem seems to alleviate, and the accuracy remains pretty much constant when we move to deeper layers. In particular, the HPCA approach exhibits an increase of almost 24% points with respect to. HWTA on the features extracted from the fourth convolutional layer. The Hebbian features appear to behave comparably or better than VAE features, especially on higher layers, with an improvement of up to 36% points on the fifth layer. Moreover, we can observe that both Hebbian approaches reach competitive performance with respect to backprop for the features extracted from the first three layers, with HPCA in particular improving by 9% points over BP on the first layer, suggesting possible applications of Hebbian learning for training relatively shallow networks. Table 6 aims to show that it is possible to replace the last two network layers (including the final classifier) with new ones, and re-train them with Hebbian approach (in this case, the supervised Hebbian algorithm is used to train the final classifier), achieving accuracy comparable to backprop (with just a performance drop smaller than 3% points when the last two layers are re-trained with HPCA), but requiring fewer training epochs (1 vs 7, respectively). This suggests potential applications in the context of transfer learning [39]. Moreover, it can be observed that HPCA performs better than HWTA.

Pros and cons of Hebbian learning
We conclude this Section with a list of pros and cons of Hebbian learning approaches, emerging from the observed results.
Pros of Hebbian learning: • Effective for training low-level feature extractors; • Produces better features than VAE for the classification task;

Conclusions and future work
In summary, our results suggest that the Hebbian approach is suitable for training early feature extraction layers or to re-train the final layers of a pre-trained deep neural network, requiring fewer training epochs than other methods. This suggests potential applications in the context of transfer learning, where an experimenter wants to re-train or fine-tune higher network layers of a pre-trained model on a new task. Hebbian approaches outperform VAE training, reducing the gap between unsupervised methods and supervised backprop training. Moreover, the HPCA methods seems to perform generally better than HWTA.
Moreover, supplementary results in Appendix 1 also show that some hybrid combinations of backprop and Hebbian layers appear to be helpful in some cases, offering performance higher than either Hebbian or supervised backprop alone.
Integration of Hebbian learning and deep learning is still an emerging topic. However, our results are encouraging, motivating further interest in this direction.
In future works, further improvements might come from exploring more complex feature extraction strategies, which can also be formulated as Hebbian learning variants, such as Independent Component Analysis (ICA) [12] and sparse coding [23,24,31]. It might be promising also to apply Hebbian learning to enhance current state-of-the-art network architectures, either as a stand-alone learning algorithm, or in combination with backprop, as an inductive bias for regularization [26], in a semi-supervised fashion.
Hebbian learning already found application in the context of meta-learning, with the differentiable plasticity model [21]. In this case, the simple Hebbian learning rule, Dw ¼ g y x, was used, but further improvements might come from applying more advanced Hebbian rules, such as those studied in this paper.
Finally, an exploration on the behavior of such algorithms with respect to adversarial examples also deserves attention.

Appendix 1: Supplementary results
In this Appendix, we present the additional results on MNIST, CIFAR10, and CIFAR100 datasets. Tables 7, 9

Hybrid network models
In Table 7, we report the results obtained on the MNIST test set with hybrid networks. In each row, we reported the results for a network with a different combination of Hebbian and backprop layers (the first row below the header represent the network fully trained with backprop). We used the letter ''H'' to denote layers trained using the Hebbian approach, and the letter ''B'' for layers trained using backprop. The letter ''G'' is used for the final classifier (corresponding to the sixth layer) trained with gradient descent. The final classifier (corresponding to the sixth layer) was trained with SGD in all the cases, in order to make comparisons on equal footings. The last two columns show the resulting accuracy obtained with the corresponding combination of layers. Table 7 allows us to understand what is the effect of switching a specific layer (or group of layers) in a network from backprop to Hebbian training. The first row represents the network fully trained with backprop. In the next rows we can observe the results of a network in which a single layer was switched. Both HPCA and HWTA exhibit comparable results with respect to full backprop training. A result slightly higher than full backprop is observed when layer 5 is replaced, suggesting that some combinations of layers might actually be helpful to increase performance. In the successive rows, more layers are switched from backprop to Hebbian training, and a slight performance drop is observed, but the HPCA approach seems to perform generally better than HWTA when more Hebbian layers are involved. The most prominent difference appears when we finally replace all the network layers with Hebbian equivalent, in which case the HPCA approach shows an increase of more than 2% points over HWTA. Table 8 shows a comparison between SHC and SGD classifiers placed on the various layers of a network pretrained with backprop. The results suggest that SHC is effective in classifying high-level features, achieving comparable accuracy as SGD, but requiring fewer training epochs. On the other hand, SHC is not so effective on lower layer features, although the convergence time is still fast, suggesting that the supervised Hebbian approach benefits from the use of more abstract latent representations. CHL appears to perform comparably to SGD training.

Hybrid network models
In Table 9, we report the results obtained on the CIFAR10 test set with hybrid networks. The table, which has the same structure as that of the previous sub-section, allows us to understand what is the effect of switching a specific layer (or group of layers) in a network from backprop to Hebbian training. The first row represents the network fully trained with backprop. In the next rows we can observe the results of a network in which a single layer was switched. Both HPCA, and HWTA exhibit competitive results with respect to full backprop training, when they are used to train the first or the fifth network layer. A small, but more significant drop is observed when inner layers are switched from backprop to Hebbian. In the successive rows, more layers are switched from backprop to Hebbian training, and a higher performance drop is observed, but the HPCA approach seems to perform better than HWTA when more Hebbian layers are involved. The most prominent difference appears when we finally replace all the deep network layers with Hebbian equivalent, in which case the HPCA approach shows an increase of 15% points over HWTA. Table 10 shows a comparison between SHC, CHL, and SGD classifiers placed on the various layers of a network pre-trained with backprop. The results suggest that SHC is effective in classifying high-level features, achieving comparable accuracy as SGD, but requiring fewer training epochs. On the other hand, SHC is not so effective on lower layer features, although the convergence time is still fast, suggesting that the supervised Hebbian approach benefits from the use of more abstract latent representations. CHL appears to perform comparably to SGD training.

CIFAR100
Hybrid network models In Table 11, we report the results obtained on the CIFAR100 test set with hybrid networks. The table, which has the same structure as those of the previous sub-sections, allows us to understand what is the effect of switching a specific layer (or group of layers) in a network from backprop to Hebbian training. The first row represents our network fully trained with backprop. In the next rows we can observe the results of a network in which a single layer was switched. HWTA exhibits competitive results with respect to full backprop when it is used to train the first or the fifth network layer. A small, but more significant drop is observed when inner layers are switched from backprop to HWTA. On the other hand, the HPCA approach seems to perform generally better than HWTA. In particular, it slightly outperforms full backprop (by 2% points), when used to train the fifth network layer, suggesting that this kind of hybrid combinations might be useful when more complex tasks are involved. In the successive rows, more layers are switched from backprop to Hebbian training, and a higher performance drop is observed, but still, the HPCA approach exhibits a better behavior than HWTA. The most prominent difference appears when we finally replace all the network layers with Hebbian equivalent, in which case the HPCA approach shows an increase of 22% points over HWTA.
Comparison of SHC and SGD Table 12 shows a comparison between SHC, CHL, and SGD classifiers placed on the various layers of a network pre-trained with backprop. In this case, SHC achieves comparable accuracy as SGD (even with a slight improvement of 6% points on layer 3), but requiring fewer training epochs, suggesting that the approach might be especially useful when more complex tasks are involved.
On the other hand, in this case, lower performance is observed when CHL is used, suggesting that this approach has more difficulties in scaling to more complex datasets.